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Preface 



Taken literally, the title “All of Statistics” is an exaggeration. But in spirit, 
the title is apt, as the book does cover a much broader range of topics than a 
typical introductory book on mathematical statistics. 

This book is for people who want to learn probability and statistics quickly. 
It is suitable for graduate or advanced undergraduate students in computer 
science, mathematics, statistics, and related disciplines. The book includes 
modern topics like nonparametric curve estimation, bootstrapping, and clas- 
sification, topics that are usually relegated to follow-up courses. The reader is 
presumed to know calculus and a little linear algebra. No previous knowledge 
of probability and statistics is required. 

Statistics, data mining, and machine learning are all concerned with 
collecting and analyzing data. For some time, statistics research was con- 
ducted in statistics departments while data mining and machine learning re- 
search was conducted in computer science departments. Statisticians thought 
that computer scientists were reinventing the wheel. Computer scientists 
thought that statistical theory didn’t apply to their problems. 

Things are changing. Statisticians now recognize that computer scientists 
are making novel contributions while computer scientists now recognize the 
generality of statistical theory and methodology. Clever data mining algo- 
rithms are more scalable than statisticians ever thought possible. Formal sta- 
tistical theory is more pervasive than computer scientists had realized. 

Students who analyze data, or who aspire to develop new methods for 
analyzing data, should be well grounded in basic probability and mathematical 
statistics. Using fancy tools like neural nets, boosting, and support vector 
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machines without understanding basic statistics is like doing brain surgery 
before knowing how to use a band-aid. 

But where can students learn basic probability and statistics quickly? Nowhere. 
At least, that was my conclusion when my computer science colleagues kept 
asking me: “Where can I send my students to get a good understanding of 
modern statistics quickly?” The typical mathematical statistics course spends 
too much time on tedious and uninspiring topics (counting methods, two di- 
mensional integrals, etc.) at the expense of covering modern concepts (boot- 
strapping, curve estimation, graphical models, etc.). So I set out to redesign 
our undergraduate honors course on probability and mathematical statistics. 
This book arose from that course. Here is a summary of the main features of 
this book. 

1. The book is suitable for graduate students in computer science and 
honors undergraduates in math, statistics, and computer science. It is 
also useful for students beginning graduate work in statistics who need 
to fill in their background on mathematical statistics. 

2. I cover advanced topics that are traditionally not taught in a first course. 
For example, nonparametric regression, bootstrapping, density estima- 
tion, and graphical models. 

3. I have omitted topics in probability that do not play a central role in 
statistical inference. For example, counting methods are virtually ab- 
sent. 

4. Whenever possible, I avoid tedious calculations in favor of emphasizing 
concepts. 

5. I cover nonparametric inference before parametric inference. 

6. I abandon the usual “First Term = Probability” and “Second Term 
= Statistics” approach. Some students only take the first half and it 
would be a crime if they did not see any statistical theory. Furthermore, 
probability is more engaging when students can see it put to work in the 
context of statistics. An exception is the topic of stochastic processes 
which is included in the later material. 

7. The course moves very quickly and covers much material. My colleagues 
joke that I cover all of statistics in this course and hence the title. The 
course is demanding but I have worked hard to make the material as 
intuitive as possible so that the material is very understandable despite 
the fast pace. 

8. Rigor and clarity are not synonymous. I have tried to strike a good 
balance. To avoid getting bogged down in uninteresting technical details, 
many results are stated without proof. The bibliographic references at 
the end of each chapter point the student to appropriate sources. 
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Probability 




Inference and Data Mining 



FIGURE 1. Probability and inference. 



9. On my website are files with R code which students can use for doing 
all the computing. The website is: 

http: / / WWW. stat.cmu.edu/ ~larry/all-of-statistics 

However, the book is not tied to R and any computing language can be 
used. 

Part I of the text is concerned with probability theory, the formal language 
of uncertainty which is the basis of statistical inference. The basic problem 
that we study in probability is: 

Given a data generating process, what are the properties of the out- 
comes? 

Part II is about statistical inference and its close cousins, data mining and 
machine learning. The basic problem of statistical inference is the inverse of 
probability: 

Given the outcomes, what can we say about the process that gener- 
ated the data? 

These ideas are illustrated in Figure 1. Prediction, classification, clustering, 
and estimation are all special cases of statistical inference. Data analysis, 
machine learning and data mining are various names given to the practice of 
statistical inference, depending on the context. 
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Part III applies the ideas from Part II to specific problems such as regres- 
sion, graphical models, causation, density estimation, smoothing, classifica- 
tion, and simulation. Part III contains one more chapter on probability that 
covers stochastic processes including Markov chains. 

I have drawn on other books in many places. Most chapters contain a section 
called Bibliographic Remarks which serves both to acknowledge my debt to 
other authors and to point readers to other useful references. I would especially 
like to mention the books by DeGroot and Schervish (2002) and Grimmett 
and Stirzaker (1982) from which I adapted many examples and exercises. 

As one develops a book over several years it is easy to lose track of where pre- 
sentation ideas and, especially, homework problems originated. Some I made 
up. Some I remembered from my education. Some I borrowed from other 
books. I hope I do not offend anyone if I have used a problem from their book 
and failed to give proper credit. As my colleague Mark Schervish wrote in his 
book (Schervish (1995)), 

“. . . the problems at the ends of each chapter have come from many 
sources. . . . These problems, in turn, came from various sources 
unknown to me ... If I have used a problem without giving proper 
credit, please take it as a compliment.” 

I am indebted to many people without whose help I could not have written 
this book. First and foremost, the many students who used earlier versions 
of this text and provided much feedback. In particular, Liz Prather and Jen- 
nifer Bakal read the book carefully. Rob Reeder valiantly read through the 
entire book in excruciating detail and gave me countless suggestions for im- 
provements. Chris Genovese deserves special mention. He not only provided 
helpful ideas about intellectual content, but also spent many, many hours 
writing IM^iKcode for the book. The best aspects of the book’s layout are due 
to his hard work; any stylistic deficiencies are due to my lack of expertise. 
David Hand, Sam Roweis, and David Scott read the book very carefully and 
made numerous suggestions that greatly improved the book. John Lafferty 
and Peter Spirtes also provided helpful feedback. John Kimmel has been sup- 
portive and helpful throughout the writing process. Finally, my wife Isabella 
Verdinelli has been an invaluable source of love, support, and inspiration. 

Larry Wasserman 
Pittsburgh, Pennsylvania 
July 2003 
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Statistics/Data Mining Dictionary 

Statisticians and computer scientists often use different language for the 
same thing. Here is a dictionary that the reader may want to return to 
throughout the course. 



Statistics 

estimation 

classification 

clustering 

data 

covariates 

classifier 

hypothesis 

confidence interval 

directed acyclic graph 

Bayesian inference 

frequentist inference 

large deviation bounds 



Computer Science 
learning 

supervised learning 

unsupervised learning 
training sample 
features 
hypothesis 



Bayes net 
Bayesian inference 



PAG learning 



Meaning 

using data to estimate 
an unknown quantity 
predicting a discrete Y 
from X 

putting data into groups 
(XuYl),...,(Xn,Yn) 
the Xi^s 

a map from covariates 
to outcomes 
subset of a parameter 
space 0 

interval that contains an 
unknown quantity 
with given frequency 
multivariate distribution 
with given conditional 
independence relations 
statistical methods for 
using data to 
update beliefs 
statistical methods 
with guaranteed 
frequency behavior 
uniform bounds on 
probability of errors 
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Probability 



1.1 Introduction 

Probability is a mathematical language for quantifying uncertainty. In this 
Chapter we introduce the basic concepts underlying probability theory. We 
begin with the sample space, which is the set of possible outcomes. 



1.2 Sample Spaces and Events 

The sample space Q is the set of possible outcomes of an experiment. Points 
o; in are called sample outcomes, realizations, or elements. Subsets of 
Q are called Events. 

1.1 Example. If we toss a coin twice then = {HH^ HT^TH, TT}. The event 
that the first toss is heads is A = {HH, HT}. m 

1.2 Example. Let u be the outcome of a measurement of some physical quan- 
tity, for example, temperature. Then Q = R = (— oc, oo). One could argue that 
taking = R is not accurate since temperature has a lower bound. But there 
is usually no harm in taking the sample space to be larger than needed. The 
event that the measurement is larger than 10 but less than or equal to 23 is 
A = (10,23]. ■ 
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1. Probability 



1.3 Example. If we toss a coin forever, then the sample space is the infinite 
set 

= |a; = (cji , (x;2 , ? • • • 7 ) • ^ {H^ ^} | • 

Let E be the event that the first head appears on the third toss. Then 

E = •••,)• <^i = T,uj 2 = T,uJs = H, (jJi e {H,T} for i > sj. ■ 

Given an event A, let = {uj E : u ^ A} denote the complement of 
A. Informally, A^ can be read as “not The complement of ft is the empty 
set 0. The union of events A and B is defined 



A\jB = ujeAoTUJEB oiioe both} 

which can be thought of as ‘‘A or B.” If Ai, ^ 2 , . . . is a sequence of sets then 



oo 

= |cc; G 0 : a; G for at least one i 

i=i 

The intersection of A and B is 



}■ 



A B — G * uj ^ A and uj ^ B^ 

read M and 5.” Sometimes we write Af^B a,s AB or (A,B). If ^ 1 ,^ 2 , • • • is 
a sequence of sets then 

00 

Pi yli = |ct; G : uj e Ai for all i|. 

i=l 

The set difference is defined by A — B — {lj : u e A, u ^ B}. If every element 
of A is also contained in B we write ^ C jB or, equivalently, B D A. If A is a 
finite set, let \A\ denote the number of elements in A. See the following table 
for a summary. 





Summary of Terminology 

sample space 


LJ 


outcome (point or element) 


A 


event (subset of ft) 


A<= 


complement of A (not A) 


A[JB 


union (^4 or B) 


A f]B or 


AB intersection {A and B) 


A-B 


set difference {uj in A but not in B) 


AcB 


set inclusion 


0 


null event (always false) 




true event (always true) 
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We say that Ai,A 2 ,... are disjoint or are mutually exclusive if Ai p| Aj = 
0 whenever i ^ j. For example, A\ = [0,1 ), ^2 = [1,2 ), ^3 = [2,3),... are 
disjoint. A partition of O is a sequence of disjoint sets Ai, A2, . . . such that 
USi Given an event A, define the indicator function of A by 

r/\ r/ fl if o;GA 

Ia{uj) - /(o; e A) - I q if ^ ^ ^ 

A sequence of sets Ai,A2,... is monotone increasing if Ai c A2 C 
• • • and we define lim^j^oo = USi ^ sequence of sets Ai, A2, . . . is 
monotone decreasing if Ai D A2 D • • • and then we define limri^oo = 
n^i either case, we will write An — > A. 

1.4 Example. Let Q = R and let Ai = [0, 1/i) for z = 1, 2, Then IJSi ~ 

[0, 1) and n^i — {0}- If instead we define Ai — (0, 1/i) then U^i ~ 

(0, 1) and Ai == 0. ■ 



1.3 Probability 

We will assign a real number P(A) to every event A, called the probability of 
A. ^ We also call P a probability distribution or a probability measure. 
To qualify as a probability, P must satisfy three axioms: 



1.5 Definition. A function P that assigns a real number P(A) to each 
event A is a probability distribution or a probability measure if it 
satisfies the following three axioms: 

Axiom 1; P(A) > 0 for every A 
Axiom 2; F{Q) = 1 

Axiom 3; // Ai, A 2 , . . . are disjoint then 

( 00 \ 00 

IjA, =^p(A,). 

i=i J i=i 



^It is not always possible to assign a probability to every event A if the sample space is large, 
such as the whole real line. Instead, we assign probabilities to a limited class of set called a 
cr-field. See the appendix for details. 
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There are many interpretations of P(A). The two common interpretations 
are frequencies and degrees of beliefs. In the frequency interpretation, P(^) 
is the long run proportion of times that A is true in repetitions. For example, 
if we say that the probability of heads is 1/2, we mean that if we flip the 
coin many times then the proportion of times we get heads tends to 1/2 as 
the number of tosses increases. An infinitely long, unpredictable sequence of 
tosses whose limiting proportion tends to a constant is an idealization, much 
like the idea of a straight line in geometry. The degree-of-belief interpretation 
is that P(A) measures an observer’s strength of belief that A is true. In either 
interpretation, we require that Axioms 1 to 3 hold. The difference in inter- 
pretation will not matter much until we deal with statistical inference. There, 
the differing interpretations lead to two schools of inference: the frequentist 
and the Bayesian schools. We defer discussion until Chapter 11. 

One can derive many properties of P from the axioms, such as: 

P(0) = 0 

Ac B => P(A) < F(B) 

0 < P(A) < 1 

F{A^) = 1 - P(A) 

A[^B = % =P(A)+P(B). (1.1) 

A less obvious property is given in the following Lemma. 

1.6 Lemma. For any events A and B, 

P (v4 [J 5 ) = P(^) + P(5) - ¥{AB). 

Proof. Write A\JB = (AB^) \J{AB) \J{A^B) and note that these events 
are disjoint. Hence, making repeated use of the fact that P is additive for 
disjoint events, we see that 

p(a[Jb) = P ((A5") IJ(AB) 

= P(AB^) + ¥{AB) + P(^^B) 

= ¥{AB‘^) + ¥{AB) + ¥{A’^B) + F{AB) - P(^B) 

= P ((AB") IJ(AB)) + P IJ(AB)) - P(^B) 

= P(il) + P(B)-P(AB). ■ 

1.7 Example. Two coin tosses. Let Hi be the event that heads occurs on 
toss 1 and let H 2 be the event that heads occurs on toss 2. If all outcomes are 
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equally likely, thenP(ifi U-^ 2 ) == | + — ^/4- 



1.8 Theorem (Continuity of Probabilities). If An A then 

¥{An) ^ P(A) 



as n ^ 00 . 

Proof. Suppose that An is monotone increasing so that Ai c A 2 C - 
Let A = limn->oo^n = USi Define Bi = Ai, B2 = {uJ ^ O, : u; G 
A2,cj ^ Ai}, ^3 == {a; G D : u £ As,uj ^ A2 ,uj ^ It can be 

shown that .B 2 , . . • are disjoint, An = UlLi ~ UlLi ^ 

Bi = Ai. (See exercise 1.) From Axiom 3, 

( n \ n 

i=l J i=l 



and hence, using Axiom 3 again, 

n 00 / 00 \ 

lim P(A„) = lim == P U 

n.^na n— >oo ^ ^ ^ V / 



i—\ 






1.4 Probability on Finite Sample Spaces 

Suppose that the sample space = {o;i, . . . , 0 ;^} is finite. For example, if we 
toss a die twice, then ft has 36 elements: ft = {(^, j); hj ^ {1? • • • 6}}. If each 
outcome is equally likely, then P(A) = |A|/36 where \A\ denotes the number 
of elements in A. The probability that the sum of the dice is 11 is 2/36 since 
there are two outcomes that correspond to this event. 

If ft is finite and if each outcome is equally likely, then 

F(^) = 

which is called the uniform probability distribution. To compute prob- 
abilities, we need to count the number of points in an event A. Methods for 
counting points are called combinatorial methods. We needn’t delve into these 
in any great detail. We will, however, need a few facts from counting theory 
that will be useful later. Given n objects, the number of ways of ordering 
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these objects is n! = n(n — l)(n — 2) • • • 3 • 2 • 1. For convenience, we define 
0! = 1. We also define 

( 7. ) “ U7Z tTo’ (^-^) 



\kj kl(n-k)r ' * ' 

read “n choose A:” , which is the number of distinct ways of choosing k objects 
from n. For example, if we have a class of 20 people and we want to select a 
committee of 3 students, then there are 



20! _ 20 X 19 X 18 
3117! ” 3x2x1 



= 1140 



possible committees. We note the following properties: 



1.5 Independent Events 

If we flip a fair coin twice, then the probability of two heads is | x We 
multiply the probabilities because we regard the two tosses as independent. 
The formal definition of independence is as follows: 

1.9 Definition. Two events A and B are independent if 

P(^5) = P(^)P(5) (1.3) 

and we write All B. A set of events {Ai : i ^ 1} is independent if 

vlf]AA^Y[¥{A,) 

\iEiJ / iGJ 

for every finite subset J of I. If A and B are not independent, we write 

A B 



Independence can arise in two distinct ways. Sometimes, we explicitly as- 
sume that two events are independent. For example, in tossing a coin twice, 
we usually assume the tosses are independent which reflects the fact that the 
coin has no memory of the first toss. In other instances, we derive indepen- 
dence by verifying that F{AB) = P(yl)P(B) holds. For example, in tossing 
a fair die, let A = {2,4,6} and let B = (1,2, 3, 4}. Then, Af]B — {2,4}, 




1.5 Independent Events 9 



F{AB) = 216 = F{A)F{B) = (1/2) x (2/3) and so A and B are independent. 
In this case, we didn’t assume that A and B are independent — it just turned 
out that they were. 

Suppose that A and B are disjoint events, each with positive probability. 
Can they be independent? No. This follows since F{A)F{B) > 0 yet F{AB) = 
p(0) = 0. Except in this special case, there is no way to judge independence 
by looking at the sets in a Venn diagram. 

1.10 Example. Toss a fair coin 10 times. Let A =“at least one head.” Let Tj 
be the event that tails occurs on the toss. Then 



P(A) = 1 - F{A^) 

= 1 — P(all tails) 

= l-P(riT2---Tio) 

= 1 - P(Ti)P(r 2 ) • • • P(Tio) using independence 



1.11 Example. Two people take turns trying to sink a basketball into a net. 
Person 1 succeeds with probability 1/3 while person 2 succeeds with proba- 
bility 1/4. What is the probability that person 1 succeeds before person 2? 
Let E denote the event of interest. Let Aj be the event that the first success 
is by person 1 and that it occurs on trial number j. Note that Ai, ^ 2 , • • • are 
disjoint and that E = U^i Hence, 

oo 

p(i;) = 

j=l 



Now, P(.Ai) = 1/3. A 2 occurs if we have the sequence person 1 misses, person 
2 misses, person 1 succeeds. This has probability P(^ 2 ) = (2/3)(3/4)(l/3) = 
(l/2)(l/3). Following this logic we see that F{Aj) = (l/2)-^”^(l/3). Hence, 



00 . 

nE) = E.s 




2 

3‘ 



Here we used that fact that, if 0 < r < 1 then = r^/(l — r). ■ 
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Summary of Independence 

1. A and B are independent if and only if P(^^) = P(A)P(jB). 

2. Independence is sometimes assumed and sometimes derived. 

3. Disjoint events with positive probability are not independent 



1.6 Conditional Probability 

Assuming that P(B) > 0, we define the conditional probability of A given 
that B has occurred as follows: 



1.12 Definition. Iff’{B) > 0 then the conditional probability 

given B is 



P(A|S) = 



V{AB) 
P(B) ■ 



of A 
(1.4) 



Think of F{A\B) as the fraction of times A occurs among those in which 
B occurs. For any fixed B such that P(5) > 0, P(-|.B) is a probability (i.e., it 
satisfies the three axioms of probability). In particular, F{A\B) > 0, P(r2|5) = 
1 and if Ai,A2,... are disjoint then Ai\B) = 

is in general not true that F{A\B\JC) = F{A\B) 4- F{A\C). The rules of 
probability apply to events on the left of the bar. In general it is not the case 
that F{A\B) = F{B\A). People get this confused all the time. For example, 
the probability of spots given you have measles is 1 but the probability that 
you have measles given that you have spots is not 1. In this case, the difference 
between F{A\B) and F{B\A) is obvious but there are cases where it is less 
obvious. This mistake is made often enough in legal cases that it is sometimes 
called the prosecutor’s fallacy. 

1.13 Example. A medical test for a disease D has outcomes + and — . The 
probabilities are: 





D 




+ 


.009 


.099 


- 


.001 


.891 
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Prom the definition of conditional probability, 

-00 



P(+|£>) 



.009 + .001 



P(£>|+) = 



'' ' ’ P(D‘=) .891 + .099 

Apparently, the test is fairly accurate. Sick people yield a positive 90 percent 
of the time and healthy people yield a negative about 90 percent of the time. 
Suppose you go for a test and get a positive. What is the probability you have 
the disease? Most people answer .90. The correct answer is 

P(DI+) = » .08. 

' ' ' P(+) .009 + .099 

The lesson here is that you need to compute the answer numerically. Don’t 
trust your intuition. ■ 

The results in the next lemma follow directly from the definition of condi- 
tional probability. 

1.14 Lemma. If A and B are independent events then F{A\B) = P(yl). Also^ 
for any pair of events A and B, 

¥{AB) = F{A\B)¥{B) = F{B\A)¥{A). 



From the last lemma, we see that another interpretation of independence is 
that knowing B doesn’t change the probability of A. The formula F{AB) = 
F{A)F{B\A) is sometimes helpful for calculating probabilities. 



1.15 Example. Draw two cards from a deck, without replacement. Let A be 
the event that the first draw is the Ace of Clubs and let B be the event that 
the second draw is the Queen of Diamonds. Then F(AB) = F{A)F{B\A) = 
(1/52) X (1/51). ■ 



Summary of Conditional Probability 



1. If P(^) > 0, then 



F{A\B) 



F{AB) 
F{B) * 



2. P(*|^) satisfies the axioms of probability, for fixed B. In general, 

P(A|‘) does not satisfy the axioms of probability, for fixed A. 

3. In general, F{A\B) 7^ F{B\A). 
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4. A and B are independent if and only if P(^|5) = P(^). 



1.7 Bayes’ Theorem 

Bayes’ theorem is the basis of “expert systems” and “Bayes’ nets,” which are 
discussed in Chapter 17. First, we need a preliminary result. 

1.16 Theorem (The Law of Total Probability). Let Ai, . . . ,Ak he a partition 
of^l. Then, for any event B, 

k 

F{B) = J^nB\Ai)¥{Ai). 

2=1 

Proof. Define Cj = BAj and note that Ci, ... ,Ck are disjoint and that 
B = Uj=i Hence, 

P(B) = = J2^(BAj) = Y,nB\Aj)V{Aj) 

3 3 3 

since F{BAj) = ¥{B\Aj)F{Aj) from the definition of conditional probability. 



1.17 Theorem (Bayes’ Theorem). Let Ai,. . . ,Ak be a partition of ft such 
that P(v4i) > 0 for each i. IfF(B) > 0 then, for each i = 1, . . . ,k. 



nAi\B) = 



F{B\Ai)F{A,) 

ZjV{B\A,)r{Aj) 



(1.5) 



1.18 Remark. We call P(A^) the prior probability of A and F{Ai\B) the 
posterior probability of A. 

Proof. We apply the definition of conditional probability twice, followed 
by the law of total probability: 

pr4 = = P(m)P(^0 , 

^ ^ P(B) F{B) J2,F{B\Aj)F{AjY 

1.19 Example. I divide my email into three categories: Ai = “spam,” A 2 = 
“low priority” and A^ = “high priority.” Prom previous experience I find that 
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p(Ai) = .7, P(^ 2 ) = .2 and P(As) = .1. Of course, .7 + .2 + .1 = 1. Let B be 
the event that the email contains the word “free.” From previous experience, 
¥{B\Ai) = .9, ¥{B\A2) = .01, ¥{B\Ai) = .01. (Note: .9 + .01 + .01 ^ 1.) I 
receive an email with the word “free.” What is the probability that it is spam? 
Bayes’ theorem yields, 

= (.9 X .7) + (.01 X .2) + (.01 X .1) “ 

1.8 Bibliographic Remarks 

The material in this chapter is standard. Details can be found in any number 
of books. At the introductory level, there is DeGroot and Schervish (2002); 
at the intermediate level, Grimmett and Stirzaker (1982) and Karr (1993); at 
the advanced level there are Billingsley (1979) and Breiman (1992). I adapted 
many examples and exercises from DeGroot and Schervish (2002) and Grim- 
mett and Stirzaker (1982). 



1.9 Appendix 

Generally, it is not feasible to assign probabilities to all subsets of a sample 
space O. Instead, one restricts attention to a set of events called a cr-algebra 
or a (j- field which is a class A that satisfies: 

(i) 0 G 

(ii) if Ai, A2, . . . , G A then Ai e A and 

(iii) A G A implies that A^ G A. 

The sets in A are said to be measurable. We call (f^,A) a measurable 
space. If P is a probability measure defined on A, then (fl. A, P) is called a 
probability space. When Q is the real line, we take A to be the smallest 
cr-field that contains all the open subsets, which is called the Borel cr-field. 



1.10 Exercises 

1. Fill in the details of the proof of Theorem 1.8. Also, prove the monotone 
decreasing case. 

2. Prove the statements in equation (1.1). 
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3. Let be a sample space and let Ai, ^ 2 , • . . , be events. Define Bn = 

U^„ and Cn = ^i- 

(a) Show that Bi D B 2 D • - and that (7i C C 2 C • • 

(b) Show that lo G H^i if and only if uj belongs to an infinite 

number of the events ^ 1 , ^ 2 , 

(c) Show that lj G IJ^i if and only if lj belongs to all the events 
yli, ^ 2 , . . . except possibly a finite number of those events. 

4. Let {Ail i G /} be a collection of events where I is an arbitrary index 
set. Show that 




Hint: First prove this for / = {1, . . . , n}. 

5. Suppose we toss a fair coin until we get exactly two heads. Describe 
the sample space S. What is the probability that exactly k tosses are 
required? 

6. Let Q = {0, 1, . . . , }. Prove that there does not exist a uniform distri- 
bution on Q (i.e., if P(A) = P(H) whenever \A\ = \B\, then P cannot 
satisfy the axioms of probability). 

7. Let Ai, ^ 2 , • • • be events. Show that 

( 00 \ 00 

n=l / n=l 

Hint: Define Bn = An — UIL/ Then show that the Bn are disjoint 
and that An = U^i 

8. Suppose that F{Ai) = 1 for each i. Prove that 

9. For fixed B such that P(H) > 0, show that P(*|H) satisfies the axioms 
of probability. 

10. You have probably heard it before. Now you can solve it rigorously. 
It is called the “Monty Hall Problem.” A prize is placed at random 
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behind one of three doors. You pick a door. To be concrete, let’s suppose 
you always pick door 1. Now Monty Hall chooses one of the other two 
doors, opens it and shows you that it is empty. He then gives you the 
opportunity to keep your door or switch to the other unopened door. 
Should you stay or switch? Intuition suggests it doesn’t matter. The 
correct answer is that you should switch. Prove it. It will help to specify 
the sample space and the relevant events carefully. Thus write = 
{(^ 1 )^ 2 ) • ^ { 172 , 3 }} where uji is where the prize is and ^2 is the 

door Monty opens. 

11. Suppose that A and B are independent events. Show that and 
are independent events. 

12. There are three cards. The first is green on both sides, the second is red 
on both sides and the third is green on one side and red on the other. We 
choose a card at random and we see one side (also chosen at random). 
If the side we see is green, what is the probability that the other side is 
also green? Many people intuitively answer 1/2. Show that the correct 
answer is 2/3. 

13. Suppose that a fair coin is tossed repeatedly until both a head and tail 
have appeared at least once. 

(a) Describe the sample space Q. 

(b) What is the probability that three tosses will be required? 

14. Show that if P(t4) = 0 or P(^) = 1 then A is independent of every other 
event. Show that if A is independent of itself then P(A) is either 0 or 1. 

15. The probability that a child has blue eyes is 1/4. Assume independence 
between children. Consider a family with 3 children. 

(a) If it is known that at least one child has blue eyes, what is the 
probability that at least two children have blue eyes? 

(b) If it is known that the youngest child has blue eyes, what is the 
probability that at least two children have blue eyes? 

16. Prove Lemma 1.14. 

17. Show that 



F{ABC) = P(A|HC)P(H|C)P(C). 
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18. Suppose k events form a partition of the sample space i.e., they 

are disjoint and Uz^i Assume that P(5) > 0. Prove that if 

F{Ai\B) < P(Ai) then ¥{Ai\B) > ¥{Ai) for some i — 2, . . . ,k. 

19. Suppose that 30 percent of computer owners use a Macintosh, 50 percent 
use Windows, and 20 percent use Linux. Suppose that 65 percent of 
the Mac users have succumbed to a computer virus, 82 percent of the 
Windows users get the virus, and 50 percent of the Linux users get 
the virus. We select a person at random and learn that her system was 
infected with the virus. What is the probability that she is a Windows 
user? 

20. A box contains 5 coins and each has a different probability of show- 
ing heads. Let Pi, . . . denote the probability of heads on each coin. 
Suppose that 

Pi = 0, P 2 = 1/4, P3 = 1/2, Pa = 3/4 and ps = 1. 

Let H denote “heads is obtained” and let Ci denote the event that coin 
i is selected. 

(a) Select a coin at random and toss it. Suppose a head is obtained. 
What is the posterior probability that coin i was selected (i — 1, . . . , 5)? 
In other words, find ¥{Ci\H) for i = 1, . . . , 5. 

(b) Toss the coin again. What is the probability of another head? In 
other words find ¥{H 2 \Hi) where Hj = “heads on toss j.” 

Now suppose that the experiment was carried out as follows: We select 
a coin at random and toss it until a head is obtained. 

(c) Find ¥{Ci\B 4 ) where B^ = “first head is obtained on toss 4.” 

21. (Computer Experiment.) Suppose a coin has probability p of falling heads 
up. If we flip the coin many times, we would expect the proportion of 
heads to be near p. We will make this formal later. Take p = .3 and 
n = 1,000 and simulate n coin flips. Plot the proportion of heads as a 
function of n. Repeat for p = .03. 

22. (Computer Experiment.) Suppose we flip a coin n times and let p denote 
the probability of heads. Let X be the number of heads. We call X 
a binomial random variable, which is discussed in the next chapter. 
Intuition suggests that X will be close to np. To see if this is true, we 
can repeat this experiment many times and average the X values. Carry 
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out a simulation and compare the average of the X’s to np. Try this for 
p = .3 and n = 10, n = 100, and n = 1, 000. 

23. (Computer Experiment.) Here we will get some experience simulating 
conditional probabilities. Consider tossing a fair die. Let A = {2, 4, 6} 
and B = {1,2, 3, 4}. Then, F{A) = 1/2, P(E) = 2/3 and P(^H) = 1/3. 
Since ¥{AB) = P(A)P(H), the events A and B are independent. Simu- 
late draws from the sample space and verify that ¥{AB) = P(A)P(H) 
where P(^) is the proportion of times A occurred in the simulation and 
similarly for F{AB) and P(H). Now find two events A and B that are not 
independent. Compute F{A),F{B) and F{AB). Compare the calculated 
values to their theoretical values. Report your results and interpret. 




2 

Random Variables 



2.1 Introduction 

Statistics and data mining are concerned with data. How do we link sample 
spaces and events to data? The link is provided by the concept of a random 
variable. 



2.1 Definition. A random variable is a mapping^ 

X I H — y M 

that assigns a real number X{u) to each outcome u. 



At a certain point in most probability courses, the sample space is rarely 
mentioned anymore and we work directly with random variables. But you 
should keep in mind that the sample space is really there, lurking in the 
background. 

2.2 Example. Flip a coin ten times. Let X{u)) be the number of heads in the 
sequence u. For example, if a; = HHTHHTHHTT^ then X{uj) = 6. ■ 



^Technically, a random variable must be measurable. See the appendix for details. 
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2.3 Example. Let fl = < Ij he the unit disk. Consider 

drawing a point at random from Q. (We will make this idea more precise 
later.) A typical outcome is of the form lj = {x, y). Some examples of random 
variables are X{uj) = x, Y{u;) = y, Z{uj) = x -\-y, and W{uj) = ■ 

Given a random variable X and a subset A of the real line, define X~^{A) = 
^(jj G ^2 I X G A^ and let 

P(X e A) = ¥{X-\A)) = P({o; G Q] X{ou) G A}) 

P(X = x) = F{X~\x)) = P({o; G Q; X{uj) = x}). 

Notice that X denotes the random variable and x denotes a particular value 
of X. 



2.4 Example. Flip a coin twice and let X be the number of heads. Then, 
P(x = 0) = P({rr}) - 1/4, P(X = 1) = ¥{{HT,TH}) = 1/2 and 
P(X 2) = = 1/4. The random variable and its distribution 

can be summarized as follows: 



UJ 


P({u;}) 


X{lo) 


TT 


1/4 


0 


TH 


1/4 


1 


HT 


1/4 


1 


HH 


1/4 


2 



X 


P(X = x) 


0 


1/4 


1 


1/2 


2 


1/4 



Try generalizing this to n flips. ■ 



2.2 Distribution Functions and Probability Functions 

Given a random variable X, we define the cumulative distribution function 
(or distribution function) as follows. 



2.5 Definition. The cumulative distribution function, or CDF, is the 
function Fx : M — > [0, 1] defined by 



Fx{x)=F{X <x). 



( 2 . 1 ) 




2.2 Distribution Functions and Probability Functions 
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FIGURE 2.1. CDF for flipping a coin twice (Example 2.6.) 



We will see later that the CDF effectively contains all the information about 
the random variable. Sometimes we write the CDF as F instead of Fx- 



2.6 Example. Flip a fair coin twice and let X be the number of heads. Then 
P(X = 0) = P(X = 2) = 1/4 and P(X = 1) = 1/2. The distribution function 



IS 



Fx{x) 



" 0 X < 0 

1/4 0<x<l 
' 3/4 1 < X < 2 

1 X > 2. 



The CDF is shown in Figure 2.1. Although this example is simple, study it 
carefully. CDF’s can be very confusing. Notice that the function is right contin- 
uous, non-decreasing, and that it is defined for all x, even though the random 
variable only takes values 0, 1, and 2. Do you see why Fx(1.4) = .75? ■ 



The following result shows that the CDF completely determines the distri- 
bution of a random variable. 



2.7 Theorem. Let X have CDF F and let Y have CDF G. If F{x) = G{x) for 
all X, then ¥{X e A) = P(T G A) for all A. ^ 

2.8 Theorem. A funetion F mapping the real line to [0, 1] is a CDF for some 
probability P if and only if F satisfies the following three eonditions: 

(i) F is non- deereasing: xi < X 2 implies that F{xi) < F{x 2 ). 

(a) F is normalized: 

lim F(x) = 0 



^Technically, we only have that P(X ^ A) = P(F G A) for every measurable event A. 
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and 

lim F(x) = 1 . 

x^oo 

(Hi) F is right- continuous: F{x) = F{x^) for all x, where 

F{x~^) = lim F(y). 

y^x 

y>x 

Proof. Suppose that F is a cdf. Let us show that (iii) holds. Let x be 
a real number and let ^1,^25- •• be a sequence of real numbers such that 
Vi > V2 > ' ' ' and lim^ i/i = x. Let Ai = (—00, i/i] and let A = (—00, x]. Note 
that A = n^i and also note that Ai D A2 D • • •. Because the events are 
monotone, hm^F(A^) =P(p|. Thus, 

F{x) = P(yl) = P = limP(Ai) = limF(yi) = F(x+). 

Showing (i) and (ii) is similar. Proving the other direction — namely, that if 
F satishes (i) , (ii) , and (iii) then it is a CDF for some random variable — uses 
some deep tools in analysis. ■ 

2.9 Definition. X is discrete if it takes countably^ many values 
{xi,X2, . . We define the probability function or probability mass 
function for X by fx{x) = F(X = x). 



Thus, fx{x) > 0 for all x G M and '^j^fx(xi) = L Sometimes we write / 
instead of /x- The CDF of X is related to fx by 

Fx{x)=¥{X<x) = fxiW- 

Xi<X 

2.10 Example. The probability function for Example 2.6 is 



fx{x. 



1/4 x = 0 
1/2 x = l 
1/4 x = 2 
0 otherwise. 



See Figure 2.2. ■ 



set is countable if it is finite or it can be put in a one-to-one correspondence with the 
integers. The even numbers, the odd numbers, and the rationals are countable; the set of real 
numbers between 0 and 1 is not countable. 






2.2 Distribution Functions and Probability Functions 
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0 1 2 X 

FIGURE 2.2. Probability function for flipping a coin twice (Example 2.6). 



2.11 Definition. A random variable X is continuous if there exists a 
function fx such that fx{x) > 0 for all x, fx{x)dx = 1 and for 
every a <b, 

rb 



P(a < X < b) = / fx{x)dx. 



( 2 . 2 ) 



The function fx is called the probability density function (pdf). We 
have that ^ 

Fx{x) = f fx{t)dt 

J —OO 

and fx{x) = F'x (x) at all points x at which Fx is differentiable. 



Sometimes we write J f{x)dx or f f to mean f{x)dx. 



2.12 Example. Suppose that X has pdf 



1 for 0 < X < 1 
0 otherwise. 



Clearly, fx{x) > 0 and J fx{x)dx = 1. A random variable with this density 
is said to have a Uniform (0,1) distribution. This is meant to capture the idea 
of choosing a point at random between 0 and 1. The CDF is given by 

{ 0 X < 0 

X 0 < X < 1 

1 X > 1. 



See Figure 2.3. 
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0 1 ^ 
FIGURE 2.3. CDF for Uniform (0,1). 



2.13 Example. Suppose that X has pdf 



fix) 



0 for X < 0 

otherwise. 



Since J f{x)dx = 1, this is a well-dehned PDF. ■ 



Warning! Continuous random variables can lead to confusion. First, note 
that if X is continuous then P(X = x) = 0 for every x. Don’t try to think 
of f{x) as P(X = x). This only holds for discrete random variables. We get 
probabilities from a PDF by integrating. A PDF can be bigger than 1 (unlike 
a mass function). For example, if f{x) = 5 for x G [0, 1/5] and 0 otherwise, 
then f{x) > 0 and f f{x)dx = 1 so this is a well-dehned PDF even though 
f{x) = 5 in some places. In fact, a PDF can be unbounded. For example, if 
f{x) = (2/3)x“^/^ for 0 < X < 1 and f{x) = 0 otherwise, then J f{x)dx = 1 
even though / is not bounded. 



2.14 Example. Let 

Jo for X < 0 

/(x) — J otherwise. 

This is not a PDF since / f(x)dx = dx/(l-\-x) = du/u = log(oo) = oo. 



2.15 Lemma. Let F be the CDF for a random variable X. Then: 
1. P(X = x) = F{x) — F{x~) where F{x~) = F{y); 
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2. < X < y) = F{y) — F{x); 

3. F{X > x) = 1 — F{x); 

4 . If X is continuous then 

F{b) - F{a) = ¥{a<X <b)=¥{a<X <b) 
= ¥{a<X <b)=¥{a<X <b). 



It is also useful to define the inverse CDF (or quantile function). 



2.16 Definition. Let X be a random variable with CDF F. The inverse 
CDF or quantile function is defined by"^ 

F~^{q) = inf|x : F{x) > 

for q G [0, 1]. If F is strictly increasing and continuous then F~^{q) is the 
unique real number x such that F{x) = q. 



We call F“^(l/4) the first quartile, F“^(l/2) the median (or second 
quartile), and F“^(3/4) the third quartile. 

Two random variables X and Y are equal in distribution — written 
X = Y — if Fx{x) = Fy{x) for all x. This does not mean that X and Y are 
equal. Rather, it means that all probability statements about X and Y will 
be the same. For example, suppose that ¥{X = 1) = P(X = —1) = 1/2. Let 
Y = -X. Then P(F = 1) = P(T = -1) = 1/2 and so X = F. But X and Y 
are not equal. In fact, ¥{X = V) = 0. 



2.3 Some Important Discrete Random Variables 

Warning About Notation! It is traditional to write X ^ F to indicate 
that X has distribution F. This is unfortunate notation since the symbol ^ 
is also used to denote an approximation. The notation X ~ F is so pervasive 
that we are stuck with it. Read X ^ F as “X has distribution F” not as “X 
is approximately F” . 



you are unfamiliar with “inf” , just think of it as the minimum. 
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The Point Mass Distribution. X has a point mass distribution at a, 
written X ^ 5^, if IP(X = a) = 1 in which case 

^w = Ci' iti 

The probability mass function is f{x) = 1 for x = a and 0 otherwise. 



The Discrete Uniform Distribution. Let /c > 1 be a given integer. 
Suppose that X has probability mass function given by 

\ f/^ for T = 1 , . . . , /c 

J\^) I Q otherwise. 

We say that X has a uniform distribution on {1, . . . , /c}. 

The Bernoulli Distribution. Let X represent a binary coin flip. Then 
P(X = 1) = p and P(X = 0) = 1 — p for some p G [0, 1]. We say that X has a 
Bernoulli distribution written X ^ Bernoulli (p). The probability function is 
f{x) = p^(l — p)^“^ for X G {0, 1}. 



The Binomial Distribution. Suppose we have a coin which falls heads 
up with probability p for some 0 < p < 1. Flip the coin n times and let 
X be the number of heads. Assume that the tosses are independent. Let 
f{x) = P(X = x) be the mass function. It can be shown that 

f (^)p^(l — p)^~^ forx = 0, ...,n 
f{x) = \ ^ ^ 

0 otherwise. 



A random variable with this mass function is called a Binomial random 
variable and we write X ^ Binomial(n,p). If Xi ^ Binomial(ni,p) and 
X 2 ^ Binomial(n 2 ,p) then Xi + X 2 ^ Binomial(ni + n 2 ,p). 



Warning! Let us take this opportunity to prevent some confusion. X is a 
random variable; x denotes a particular value of the random variable; n and p 
are parameters, that is, hxed real numbers. The parameter p is usually un- 
known and must be estimated from data; that’s what statistical inference is all 
about. In most statistical models, there are random variables and parameters: 
don’t confuse them. 



The Geometric Distribution. X has a geometric distribution with 
parameter p G (0, 1), written X ^ Geom(p), if 

F{X = k) = p{l - , k>l. 
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We have that 

oo oo 

= = 1 , 

k=l k=l ^ 

Think of X as the number of flips needed until the hrst head when flipping a 
coin. 



The Poisson Distribution. X has a Poisson distribution with parameter 
A, written X ^ Poisson(A) if 

T > 0. 

= e“^e^ = 1. 

x=0 x=0 

The Poisson is often used as a model for counts of rare events like radioactive 
decay and traffic accidents. If X\ ^ Poisson(Ai) and X 2 ^ Poisson(A 2 ) then 
Xi + X 2 ^ Poisson(Ai + A 2 ). 



f{x)=e-^- 



XI 



Note that 






Warning! We dehned random variables to be mappings from a sample 
space to M but we did not mention the sample space in any of the distri- 
butions above. As I mentioned earlier, the sample space often “disappears” 
but it is really there in the background. Let’s construct a sample space ex- 
plicitly for a Bernoulli random variable. Let Q = [0, 1] and define P to satisfy 
P([a, 6]) = b — a for 0<a<b< 1. Fix p G [0, 1] and define 




uo <p 
uj > p. 



Then F{X = 1) = P(cj < p) = P([0,p]) = p and P(X = 0) = 1 — p. Thus, 
X ^ Bernoulli(p). We could do this for all the distributions dehned above. In 
practice, we think of a random variable like a random number but formally it 
is a mapping dehned on some sample space. 
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The Uniform Distribution. X has a Uniform(a,6) distribution, written 
X ^ Uniform(a, 6), if 




for X G [a, b] 
otherwise 
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where a < b. The distribution function is 



F{x) 



0 

x — a 
b—a 

1 



X < a 
X G [a, b] 
X > b. 



Normal (Gaussian). X has a Normal (or Gaussian) distribution with 
parameters jjl and cr, denoted by X N{ii,cr^), if 

where /i G M and cr > 0. The parameter jjl is the “center” (or mean) of the 
distribution and cr is the “spread” (or standard deviation) of the distribu- 
tion. (The mean and standard deviation will be formally dehned in the next 
chapter.) The Normal plays an important role in probability and statistics. 
Many phenomena in nature have approximately Normal distributions. Later, 
we shall study the Central Limit Theorem which says that the distribution of 
a sum of random variables can be approximated by a Normal distribution. 

We say that X has a standard Normal distribution if /i = 0 and cr = 1. 
Tradition dictates that a standard Normal random variable is denoted by Z. 
The PDF and CDF of a standard Normal are denoted by (j){z) and <L(z). The 
PDF is plotted in Figure 2.4. There is no closed- form expression for <F. Here 
are some useful facts: 

(i) If X - X(/i, 0-2), then Z = (X - ii)/(j - X(0, 1). 

(ii) If Z - X(0, 1), then X = fiFaZ ^ X(/i, a^). 

(hi) If Xi ^ i = 1, . . . , n are independent, then 

n / n n \ 

i=l \i=l i=l / 

It follows from (i) that if X ^ then 

P (a < X < 6) = P (^^^ < ^ < ^^) 

Thus we can compute any probabilities we want as long as we can compute 
the CDF <F(z) of a standard Normal. All statistical computing packages will 
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FIGURE 2.4. Density of a standard Normal. 

compute <F(z) and Most statistics texts, including this one, have a 

table of values of 



2.17 Example. Suppose that X ^ A^(3,5). Find P(X > 1). The solution is 

F{X > 1) = 1 - P(X < 1) = 1 - P = 1 - $(-0.8944) = 0.81. 

Now hnd q = <F“^(0.2). This means we have to find q such that P(X < q) = 
0.2. We solve this by writing 

0.2 = F{X <q)=w(^Z < ■ 

From the Normal table, <F(— 0.8416) = 0.2. Therefore, 

-0.8416 = 

cr ^/5 

and hence q = 3 — 0.8416\/5 = 1.1181. ■ 



Exponential Distribution. X has an Exponential distribution with 
parameter /3, denoted by X ^ Exp(/3), if 

/(x) = Ie--/^ x>0 

where (3 > 0. The exponential distribution is used to model the lifetimes of 
electronic components and the waiting times between rare events. 



Gamma Distribution. For o > 0, the Gamma function is defined by 
F(o) = y^~^e~ydy. X has a Gamma distribution with parameters a and 
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/3, denoted by X ^ Gamma(ce, /3), if 



fix) 



a-l -xj f3 

/3«r(a) 



X > 0 



where ce,/3 > 0. The exponential distribution is just a Gamma(l, /3) distribu- 
tion. If Xi ^ Gamma((a^, (3) are independent, then XlILi ^ Gamma(^^^^ ai 



The Beta Distribution. X has a Beta distribution with parameters 
(T > 0 and /3 > 0, denoted by X ^ Beta((a, /3), if 



t AND Cauchy Distribution. X has a t distribution with u degrees of 
freedom — written X ^ t,, — if 



fix) = 



rm 



1 



r(f) + 



The t distribution is similar to a Normal but it has thicker tails. In fact, the 
Normal corresponds to a t with jy = oo. The Gauchy distribution is a special 
case of the t distribution corresponding to u = 1. The density is 



fix) = 



To see that this is indeed a density: 



1 



7t(1 + 



/: 



1 



f{x)dx = — 



dx 



1 



1 + 7T , 



dtan ^(x) 
dx 



t [tan ^(oo)-tan i(-oo)] = I f “ (“f ) = 1- 



The distribution. X has a x^ distribution with p degrees of freedom 
- written X ~ Xc — if 



fix) = 



T{pI2)2pI^' 



rjpf^) Ig x/2^ X > 0. 



U Zi, . . . , Zp are independent standard Normal random variables then Yf!i=i 



xl- 
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2.5 Bivariate Distributions 

Given a pair of discrete random variables X and F, define the joint mass 
function by /(x, y) = P(X = x and Y = y). From now on, we write P(X = 
X and y = ^) as P(X = x^Y = y). We write / as fx,Y when we want to be 
more explicit. 

2.18 Example. Here is a bivariate distribution for two random variables X 
and Y each taking values 0 or 1: 

y = 0 y = 1 

X=0 1/9 2/9 1/3 

X=1 2/9 4/9 2/3 

1/3 ^ \1 

Thus, /(1, 1) = ¥{X = 1, y = 1) = 4/9. ■ 

2.19 Definition. In the continuous case, we call a function f{x,y) a PDF 
for the random variables (X,Y) if 

(i) f{x,y) > 0 for all {x,y), 

(ii) fZo Yoo y)dxdy = 1 and, 

(Hi) for any set yl C M x R, P((V, Y) ^ A) = f Y f{x, y)dxdy. 



In the discrete or continuous case we define the joint CDF as Fx,Y{x,y) = 
^{X <x,Y <y). 

2.20 Example. Let {X,Y) be uniform on the unit square. Then, 

I 1 if 0 < a; < 1, 0 < y < 1 
J\x,y) Q otherwise. 

Find P(X < 1/2, y < 1/2). The event A = {X < 1/2, y < 1/2} corresponds 
to a subset of the unit square. Integrating / over this subset corresponds, in 
this case, to computing the area of the set A which is 1/4. So, P(X < 1/2, y < 
1/2) = 1/4. . 
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2.21 Example. Let (X^Y) have density 

X -\- y ifO<x<l, 0<^<! 



rf \ _ j ^ y ifO<x< 
J[^jy) Q otherwise. 



Then 



r ~\ r 

/ / {x -\- y)dxdy = / xdxdy-\- y dx 

Jo Jo Jo Uo \ Jo Uo 

= I \dy + l ydy = \ + \ = l 



dy 



which verihes that this is a PDF 



2.22 Example. If the distribution is dehned over a non-rectangular region, 
then the calculations are a bit more complicated. Here is an example which I 
borrowed from DeGroot and Schervish (2002). Let (X^Y) have density 

n/ N f cx‘^y if x^ < y < 1 

fix, y) - G otherwise. 



Note hrst that — 1 < x < 1. Now let us hnd the value of c. The trick here is 
to be careful about the range of integration. We pick one variable, x say, and 
let it range over its values. Then, for each fixed value of x, we let y vary over 
its range, which is x^ < ^ < 1. It may help if you look at Figure 2.5. Thus, 



1 = 



= c 



J j f{x^y)dydx = c J j x^ydydx 

2I 

JY- 



1-1 Ux 



dx = c 



— x^ ^ 4c 

— — dx = — . 
2 21 



Hence, c = 21/4. Now let us compute P(X > Y). This corresponds to the set 
A = {(x, ^);0<x<l,x^<^< x}. (You can see this by drawing a diagram.) 
So, 



P(X > Y) 
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FIGURE 2.5. The light shaded region is < y < 1. The density is positive over 
this region. The hatched region is the event X >Y intersected with <y <1. 

2.6 Marginal Distributions 



2.23 Definition. If{X,Y) have joint distribution with mass function 
fx,Y, then the marginal mass function for X is defined by 

fx (x) =F{X = x) = X,Y = y) = J2 v) (2-4) 

y y 

and the marginal mass function for Y is defined by 

friy) =F{Y = y)=^F{X = x,Y = y)=Yi fix, y). (2.5) 



2.24 Example. Suppose that fx,Y is given in the table that follows. The 
marginal distribution for X corresponds to the row totals and the marginal 
distribution for Y corresponds to the columns totals. 





Y = 0 


Y = 1 




x=o 


1/10 


2/10 


3/10 


X=1 


3/10 


4/10 


7/10 




4/10 


6/10 


1 



For example, /x(0) = 3/10 and /x(l) = 7/10. ■ 
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2.25 Definition. For continuous random variables, the marginal densities 
are 

fx{x) = I f{x,y)dy, and fyiv) = j f{x,y)dx. (2.6) 

The corresponding marginal distribution functions are denoted by Fx and 
Fy 



2.26 Example. Suppose that 



fx,Y{x,y) = e 






for X, ^ > 0. Then fx{x) = e ^ e ^dy = e 

2.27 Example. Suppose that 



f{x,y) = 



X F y if0<x<l, 0<^<! 
0 otherwise. 



friy) = / {x + y)dx= xdx+ / ydx=-+y. 
Jo Jo Jo ^ 



2.28 Example. Let (X,Y) have density 



\ f if < y < 1 

0 otherwise. 



/ 21 21 

f{x, y)dy = —x^ J ^ydy = ~^x‘^{l - x" 

for —1 < X < 1 and fx{x) = 0 otherwise. ■ 



2.7 Independent Random Variables 



2.29 Definition. Two random variables X and Y are independent if, 
for every A and B, 

F{X e A, Y eB)= P(X G A)F(Y e B) (2.7) 

and we write X II V. Otherwise we say that X and Y are dependent 
and we write X Y. 
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In principle, to check whether X and Y are independent we need to check 
equation (2.7) for all subsets A and B. Fortunately, we have the following 
result which we state for continuous random variables though it is true for 
discrete random variables too. 

2.30 Theorem. Let X and Y have joint pdf fx,Y- Then X UY if and only 
if fx,Y{x,y) = fx{x)fY{y) for all values x and y. ® 



2.31 Example. Let X and Y have the following distribution: 





Y = 0 


Y = 1 




x=o 


1/4 


1/4 


1/2 


X=1 


1/4 


1/4 


1/2 




1/2 


1/2 


1 



Then, /x(0) = /x(l) = 1/2 and /v(0) = /v(l) = 1/2. X and Y are inde- 
pendent because /x(0)/y(0) = /(0,0), /x(0)/y(l) = /(0, 1), /x(l)/y(0) = 
/(I, 0), /x(l)/v(l) = /(I, !)• Suppose instead that X and Y have the follow- 
ing distribution: 





Y = 0 


Y = 1 




X=0 


1/2 


0 


1/2 


X=1 


0 


1/2 


1/2 




1/2 


1/2 


1 



These are not independent because /x(0)/v(l) = (1/2) (1/2) = 1/4 yet 

/( 0 , 1 )= 0 . ■ 



2.32 Example. Suppose that X and Y are independent and both have the 
same density 

_ f 2x if 0 < X < 1 
J\^) I Q otherwise. 

Let us hnd ¥{X -\-Y < 1). Using independence, the joint density is 



f{x,y) = fx{x)fv{y) 



Axy if0<x<l, 0<^<1 
0 otherwise. 



^The statement is not rigorous because the density is defined only up to sets of 
measure 0. 
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p(x + r < 1) 



/ / f{x,y)dydx 

J J x-\-y<l 

nl r ,1-x n 

A x ydy dx 

Jo [Jo 



= A X 



-dx = 

6 



The following result is helpful for verifying independence. 

2.33 Theorem. Suppose that the range of X and Y is a (possibly infinite) 
rectangle. If f{x^y) = g{x)h{y) for some functions g and h (not necessarily 
probability density functions) then X andY are independent. 

2.34 Example. Let X and Y have density 

.. \ if X > 0 and ^ > 0 



f{x,y) 



otherwise. 



The range of X and Y is the rectangle (0, oo) x (0, oo). We can write /(x, y) = 
g{x)h{y) where g{x) = 2e“^ and h{y) = Thus, X II V. ■ 

2.8 Conditional Distributions 



If X and Y are discrete, then we can compute the conditional distribution of 
X given that we have observed Y = y. Specihcally, F(X = x\Y = y) = P(X = 
x,T = y)/¥{Y = y). This leads us to dehne the conditional probability mass 
function as follows. 



2.35 Definition. The conditional probability mass function is 



fx\Y{x\y)=nX = x\Y = y) 



P(X = x,Y = y) ^ fx,y{x,y) 
P(y = y) “ fy{y) 



if friv) > 0. 



For continuous distributions we use the same definitions. ® The interpre- 
tation differs: in the discrete case, fx\Y{Av) is = y), but in the 

continuous case, we must integrate to get a probability. 

®We are treading in deep water here. When we compute P(X G A\Y = y) in the 
continuous case we are conditioning on the event {Y = y} which has probability 0. We 






2.8 Conditional Distributions 
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2.36 Definition. For continuous random variables, the conditional 
probability density function is 



fx\Y{x\y) 



fx,Y{x,y) 

fY{y) 



assuming that friv) > 0* Then, 



nX^A\Y = y)= / fx\y{x\y)dx. 

Ja 



2.37 Example. Let X and Y have a joint uniform distribution on the unit 
square. Thus, fx\v{x\y) = 1 for 0 < x < 1 and 0 otherwise. Given Y = y, X 
is Uniform(0, 1). We can write this as X\Y = y ^ Uniform(0, 1). ■ 

From the dehnition of the conditional density, we see that fx,Y{x,y) = 
fx\Y{x\y)fY{y) = fY\x{y\x)fx{x). This can sometimes be useful as in exam- 
ple 2.39. 

2.38 Example. Let 

ff-r = I if 0 ^ ^ ^ 0 ^ y ^ 1 

■'i 1 0 otherwise. 



Let us hnd P(X < 1/4|T = 1/3). In example 2.27 we saw that /v(^) = 
y + (1/2). Hence, 




2.39 Example. Suppose that X ^ Uniform(0, 1). After obtaining a value of 
X we generate Y\X = x ^ Uniform (x, 1). What is the marginal distribution 



avoid this problem by defining things in terms of the pdf. The fact that this leads to 
a well-defined theory is proved in more advanced courses. Here, we simply take it as a 
definition. 
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of V? First note that, 



and 




if 0 < X < 1 
otherwise 



fY\x{y\x) 



ii 0 <x <y <1 
0 otherwise. 



So, 



/x,y(a;,y) = fY\x{y\x)fxi.x) 



ifO<x<y<l 
0 otherwise. 



The marginal for Y is 

ry fy 

fy{y)= / fx,v{x,y)dx= / 

Jo Jo 1 - ^ 

for 0 < ^ < 1. ■ 



^ ^ du 



log(l - y) 



2.40 Example. Consider the density in Example 2.28. Let’s hnd /r|x(jk)- 
When X = X, y must satisfy x^ < y < 1. Earlier, we saw that fx{x) = 
(21/8)x^(l — x^). Hence, for x‘^ < y < 1, 



fY\x{y\x) 



f{x,y) 

fx{x) 






21 



— X^) 



1 — x^ 



Now let us compute F(V > 3/4|X = 1/2). This can be done by hrst noting 
that /y|x(j|l/2) = 32^/15. Thus, 

F{Y > 3/4|X = 1/2) = d f{y\l/2)dy = f ^dy = L. . 

J3/4 J3/4 



2.9 Multivariate Distributions and IID Samples 

Let X = (Xi, . . . , Xn) where Xi, . . . , X^ are random variables. We call X a 
random vector. Let /(xi, . . . , denote the PDF. It is possible to dehne 
their marginals, conditionals etc. much the same way as in the bivariate case. 
We say that Xi, . . . , X^ are independent if, for every Hi, ... , 

n 

P(Xi eni,...,V„eA„) = p[P(VeAp. 

i=l 

It suffices to check that /(xi, . . . , = fllLi /x. (Xi). 



( 2 . 8 ) 
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2.41 Definition. If Xi, . . . , Xn are independent and eaeh has the same 
marginal distribution with CDF F , we say that Xi, . . . , X^ are IID 
(independent and identieally distributed) and we write 

If F has density f we also write Xi, . . . X^ ^ /. We also eall Xi, . . . , X^ 

a random sample of size n from F, 



Much of statistical theory and practice begins with IID observations and we 
shall study this case in detail when we discuss statistics. 



2.10 Two Important Multivariate Distributions 



Multinomial. The multivariate version of a Binomial is called a Multino- 
mial. Consider drawing a ball from an urn which has balls with k different 
colors labeled “color 1, color 2, ..., color k.” Let p = where 

Pj > 0 and J2j=iPj = 1 suppose that pj is the probability of drawing 
a ball of color j. Draw n times (independent draws with replacement) and 
let X = (Xi, . . . ,X/c) where Xj is the number of times that color j appears. 
Hence, n = We say that X has a Multinomial (n,p) distribution 

written X ^ Multinomial(n,p). The probability function is 



/o 



n 

Xi ...Xk 



pr-'-pi'’ 



(2.9) 



where 



n 

Xi ...Xk 



Xi\ - 'XkV 



2.42 Lemma. Suppose that X ^ Multinomial(n,p) where X = (Xi, . . . ,X/c) 
and p = (pi, . . . ,P/c). The marginal distribution of Xj is Binomial (n,pj). 



Multivariate Normal. The univariate Normal has two parameters, fi 
and (J. In the multivariate version, /i is a vector and a is replaced by a matrix 
E. To begin, let 



\Zk J 
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where Zi, . . . , ^ A^(0, 1) are independent. The density of Z is ^ 

k I /c I 

f(z) = 

= (2^“p{4V' 

We say that Z has a standard multivariate Normal distribution written Z ^ 
A^(0, 1) where it is understood that 0 represents a vector of k zeroes and I is 
the k X k identity matrix. 

More generally, a vector X has a multivariate Normal distribution, denoted 
by X ^ X(/i, E), if it has density ® 

= (27r)fc/2|(S)r/" exp |-l(x - - ^)| (2.10) 

where |E| denotes the determinant of E, /i is a vector of length k and E is a 
k X k symmetric, positive dehnite matrix. ^ Setting /i = 0 and E = / gives 
back the standard Normal. 

Since E is symmetric and positive definite, it can be shown that there exists 
a matrix E^/^ — called the square root of E — with the following properties: 
(i) E^/^ is symmetric, (ii) E = E^/^E^/^ and (hi) E^/^E“^/^ = E“^/^E^/^ = I 
where E-^/^ = (E^/^)-!. 

2.43 Theorem. // Z - X(0,/) and X = /j ^ Y}I^Z then X - X(/i,E). 

Conversely, ifX^N{ia,E), then - /a) ^ N{0, 1). 

Suppose we partition a random Normal vector X as X = (Xa^Xj^) We can 
similarly partition /i = (/ia^/^b) and 

I ^aa ^ab 

V ^ba ^bb 

2.44 Theorem. Let X ^ X(/i, E). Then: 

(1) The marginal distribution of X^ is X^ 

(2) The conditional distribution of X\) given X^ = Xa is 

X^IXq, = Xa ^ X ( /i5 T YjijoTaa To)'/ ^bb '^ba'^aa ^ab ) • 

(3) If a is a vector then X ^ X(a^/i, a^Ea). 

(4) V = {X - ^ xl 



^If a and b are vectors then a^b = Uibi. 

®E“^ is the inverse of the matrix E. 

matrix S is positive definite if, for all nonzero vectors x, > 0. 




2.11 Transformations of Random Variables 
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Suppose that X is a random variable with PDF fx and CDF Fx- Let Y = r(X) 
be a function of X, for example, Y = X^ or V = . We call Y = r(X) a 

transformation of X. How do we compute the PDF and CDF of V? In the 
discrete case, the answer is easy. The mass function of Y is given by 

friy) = F{Y = y)=F{r{X) = y) 

= P({a;; r(x) = y}) = P(X G r~^{y)). 



2.45 Example. Suppose that P(X = —1) = P(X = 1) = 1/4 and P(X = 0) = 
1/2. Let Y = X^. Then, P(T = 0) = P(X = 0) = 1/2 and P(T = 1) = P(X = 
1) + P(X = —1) = 1/2. Summarizing: 



X fxjx) 

-1 1/4 

0 1/2 
1 1/4 



y friy) 
0 1/2 
1 1/2 



Y takes fewer values than X because the transformation is not one-to-one. ■ 



The continuous case is harder. There are three steps for hnding /y: 





Three Steps for Transformations 




1. For each y 


f, hnd the set Ay = 


{x : r(x) < yj. 




2. Find the CDF 








Friy) = 


IPV <y)= P(r(X) < y) 
P({x; r(x) < yj) 






= 


/ fx(x)dx. 

JAy 


(2.11) 


3. The PDF i 


s friy) = F{.{y). 







2.46 Example. Let fx{x) — e ^ for x > 0. Hence, Fx{x) = Jq fx{s)ds = 
1 — e~^. Let Y = r(X) = logX. Then, Ay = {x : x < e^} and 

Friy) = P(T < y) = P(logX < y) 

= F{X <ey)=Fx{ey) = l-e-^\ 

Therefore, /y(y) = for y G M. ■ 
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2.47 Example. Let X ^ Uniform(— 1, 3). Find the PDF of Y = X‘^. The 
density of X is 

. . _ f 1/4 if - 1< X < 3 
Jx[x) Q otherwise. 

Y can only take values in (0, 9). Consider two cases: (i) 0 < ^ < 1 and (ii) 1 < 
y < 9. For case (i), Ay = [~^/y,y/y] and Friy) = fx(x)dx = (1/2)^. 
For case (ii), Ay = [-l,y/y] and Fy(y) = J^^fx{x)dx = (l/4)(^ + 1). 
Differentiating F we get 

r if0<y<l 

fyiy) = 1 ^ if 1< y < 9 
[ 0 otherwise. ■ 

When r is strictly monotone increasing or strictly monotone decreasing then 
r has an inverse s = r~^ and in this case one can show that 

/r(y) = /x(s(y))|^|. (2.12) 

2.12 Transformations of Several Random Variables 

In some cases we are interested in transformations of several random variables. 
For example, if X and Y are given random variables, we might want to know 
the distribution of XjY ^ X + V, max{X, V} or min{X, V}. Let Z = r(X, Y) 
be the function of interest. The steps for hnding fz are the same as before: 

Three Steps for Transformations 

1. For each z, hnd the set = {{x^y) : r{x^y) < z}. 

2. Find the CDF 

Fz(z) = F(Z <z)= P(r(X, Y) < z) 

= IP({(a^,2/); r{x,y)<z}) = J fx,Y{x,y) dx dy. 



3. Then fz{z) = F'^z). 
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2.48 Example. Let Xi,X 2 ^ Uniform(0, 1) be independent. Find the density 
of y = Xi -h X 2 . The joint density of (Xi, X 2 ) is 






1 0 < < 1, 0 < X2 < 1 

0 otherwise. 



Let r(xi, X 2 ) = xi -\- X 2 - Now, 



Friv) = P(y<y)=P(r(Xi,X2)<y) 

= P({(a:i,a;2) : r{xi,X2) < y}) = J j f{xi,X2)dxidx2- 



Now comes the hard part: hnding Ay. First suppose that 0 < ^ < 1. Then Ay 
is the triangle with vertices (0, 0), (^, 0) and (0, y). See Figure 2.6. In this case, 
/ Ja f{xi^X 2 )dxidx 2 is the area of this triangle which is y‘^ /2. If 1 < 2 / < 2 , 
then Ay is everything in the unit square except the triangle with vertices 
( 1 , 2 ; — 1), (1, 1), {y — 1, !)■ This set has area 1 — (2 — 2/)^/2- Therefore, 



Fy{v) 



( 0 




{2-y? 

2 



2 / < 0 
0 < 2/ < 1 
1 < 22<2 
2 / > 2 . 



By differentiation, the PDF is 



friy) 



y 0 < 2/ < 1 

2-y l<y<2 

0 otherwise. ■ 



2.13 Appendix 

Recall that a probability measure P is dehned on a cr-held ^ of a sample 
space ft. A random variable X is a measurable map X : ^ R. Measurable 

means that, for every x, {uo : X{uj) < x} G M. 



2.14 Exercises 

1. Show that 

P(X = a:) = F[xA - F[x-). 
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0 



1 



This is the case 0 < ^ < 1. This is the case 1 <y <2. 



FIGURE 2.6. The set Ay for example 2.48. Ay consists of all points {x\^X 2 ) in the 
square below the line X 2 = y — xi. 



2. Let X be such that ¥{X = 2) = P(X = 3) = 1/10 and ¥{X = 5) = 8/10. 
Plot the CDF F. Use F to find P(2 < X < 4.8) and P(2 < X < 4.8). 

3. Prove Lemma 2.15. 



4. Let X have probability density function 



fx{x) 



1/4 0 < X < 1 

3/8 3<x<5 

0 otherwise. 



(a) Find the cumulative distribution function of X. 

(b) Let Y = 1/X. Find the probability density function /y(y) for Y. 

Hint: Consider three cases: and y > 1. 

5. Let X and Y be discrete random variables. Show that X and Y are 
independent if and only if ^ /x(^)/v(l/) for all x and y. 



6. Let X have distribution F and density function / and let H be a subset 
of the real line. Let Ia{x) be the indicator function for A: 



Ia{x) 



1 X e A 
0 X ^ A. 



Let Y = Ia{X). Find an expression for the cumulative distribution of 
Y . (Hint: first find the probability mass function for Y .) 





2.14 Exercises 
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7. Let X and Y be independent and suppose that each has a Uniform(0, 1) 
distribution. Let Z = min{X, F}. Find the density fz{z) for Z. Hint: 
It might be easier to hrst find P(Z > z). 

8. Let X have CDF F. Find the CDF of = max{0,X}. 

9. Let X ^ Exp(/3). Find F{x) and F~^{q). 

10. Let X and Y be independent. Show that g{X) is independent of h{Y) 
where g and h are functions. 

11. Suppose we toss a coin once and let p be the probability of heads. Let 
X denote the number of heads and let Y denote the number of tails. 

(a) Prove that X and Y are dependent. 

(b) Let N ^ Poisson(A) and suppose we toss a coin N times. Let X and 
Y be the number of heads and tails. Show that X and Y are independent. 

12. Prove Theorem 2.33. 

13. Let X ^ N{0, 1) and let Y — . 

(a) Find the PDF for Y . Plot it. 

(b) (Computer Experiment.) Generate a vector x = (xi, . . . ,xio,ooo) con- 

sisting of 10,000 random standard Normals. Let y = (^i, . . . , ^io,ooo) 
where yi = . Draw a histogram of y and compare it to the PDF you 

found in part (a). 

14. Let (X, Y) be uniformly distributed on the unit disk {(x, y) : x‘^ Fy‘^ < 
1}. Let R = VX^ + y2. Find the CDF and PDF of R. 

15. (A universal random number generator.) Let X have a continuous, strictly 
increasing CDF F. Let Y = F{X). Find the density of Y. This is called 
the probability integral transform. Now let U ^ Uniform(0,l) and let 
X = F-\U). Show that X F. Now write a program that takes 
Uniform (0,1) random variables and generates random variables from 
an Exponential (/3) distribution. 

16. Let X ^ Poisson(A) and Y ^ Poisson(/i) and assume that X and Y are 
independent. Show that the distribution of X given that X + F = n is 
Binomial(n, 7 t) where tt = A/(A + /i). 
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Hint 1: You may use the following fact: If X ^ Poisson(A) and Y ^ 
Poisson(/i), and X and Y are independent, then X-hY ^ Poisson(/i + A). 

Hint 2: Note that {X = x, X + Y = n} = {X = x, Y = n — x}. 



17. Let 



/x,y(a;,y) 



c{x + 0 < X < 1 and 0 < y < 1 

0 otherwise. 



FindP(X<y r=i). 



18. Let X ^ X(3, 16). Solve the following using the Normal table and using 
a computer package. 



(a) Find P(X < 7). 

(b) Find P(X > -2). 

(c) Find X such that P(X > x) = .05. 

(d) Find P(0 < X < 4). 

(e) Find x such that P(|X| > \x\) = .05. 

19. Prove formula (2.12). 

20. Let X, Y Uniform(0, 1) be independent. Find the PDF for X — Y and 
X/Y. 

21. Let Xi, . . . , Xn ^ Exp(/3) be IID. Let Y = maxjXi, . . . , X^}. Find the 
PDF of Y. Hint: Y < y if and only if X^ < y foii = 1, . . . , n. 




3 

Expectation 



3.1 Expectation of a Random Variable 

The mean, or expectation, of a random variable X is the average value of X. 

3.1 Definition. T/ie expected value, or mean, or first moment, of 

X is defined to be 

y^,^xf(x) if X is discrete 
, (3.1) 

j xf{x)dx if X is continuous 

assuming that the sum (or integral) is well defined. We use the following 
notation to denote the expected value of X: 

E(X) = EX = J xdF{x) = /i = /ix- (3.2) 

The expectation is a one-number summary of the distribution. Think of 
E(X) as the average XlILi of ^ large number of IID draws Xi, . . . ,X^. 
The fact that E(X) actually more than a heuristic; it is a 

theorem called the law of large numbers that we will discuss in Chapter 5. 

The notation J xdF{x) deserves some comment. We use it merely as a 
convenient unifying notation so we don’t have to write ^/(^) for discrete 



E(X) = / xdF{x) 






48 



3. Expectation 



random variables and f xf{x)dx for continuous random variables, but you 
should be aware that J xdF{x) has a precise meaning that is discussed in real 
analysis courses. 

To ensure that E(X) is well dehned, we say that E(X) exists if \x\dFx{x) < 
oo. Otherwise we say that the expectation does not exist. 

3.2 Example. Let X ^ Bernoulli(p). Then E(X) = Ylx=o^f(^) = (0 x (1 — 
p)) + (1 X p) = p. ■ 



3.3 Example. Flip a fair coin two times. Let X be the number of heads. Then, 
E(X) = JxdFxix) = Ex^fx(x) = (0 X /(O)) + (1 X /(!)) + (2 x /(2)) = 
(0x(l/4)) + (lx(l/2)) + (2x(l/4)) = l. ■ 

3.4 Example. Let X ^ Uniform(— 1, 3). Then, E(X) = f xdFx{x) = f xfx{x)dx 
j xdx = 1. m 



3.5 Example. Recall that a random variable has a Cauchy distribution if it 
has density fx{x) = {7t(1 + Using integration by parts, (set u = x 

and V = tan“^ x). 



f . I . 2 xdx r _i / Ni 

/ \x\dF{x) = — = \x tan ^(x)l 

J 7T Jo 1 + ^ 



tan ^ xdx = oo 



so the mean does not exist. If you simulate a Cauchy distribution many times 
and take the average, you will see that the average never settles down. This 
is because the Cauchy has thick tails and hence extreme observations are 
common. ■ 



From now on, whenever we discuss expectations, we implicitly assume that 
they exist. 

Let Y = r{X). How do we compute E(T)? One way is to find friy) and 
then compute E(T) = f yfY{y)dy. But there is an easier way. 



3.6 Theorem (The Rule of the Lazy Statistician). Let Y = r{X). Then 



E(T) 



/ 



E(r(X)) = / r{x)dFx{x). 



(3.3) 



This result makes intuitive sense. Think of playing a game where we draw 
X at random and then I pay you Y = r(X). Your average income is r{x) times 
the chance that X = x, summed (or integrated) over all values of x. Here is 
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a special case. Let A be an event and let r(x) = Ia(^) where Ia(^) = 1 if 
X e A and Ia{^) = 0 if x ^ A. Then 

E{Ia{X)) = I lA{x)fx{x)dx = Jjx{x)dx = P(X e A). 

In other words, probability is a special case of expectation. 



3.7 Example. Let X ^ Unif(0, 1). Let Y = r(X) = . Then, 

E(y) = f f{x)dx = f e^dx = e — 1. 

Jo Jo 

Alternatively, you could hnd fyiv) which turns out to be fyiv) = ^/v for 
Ky <e. Then, E{Y) = J^yf{y)dy = e - 1. m 

3.8 Example. Take a stick of unit length and break it at random. Let Y be 
the length of the longer piece. What is the mean of V? If X is the break point 
then X ^ Unif(0, 1) and Y = r(X) = max{X, 1 — X}. Thus, r{x) = 1 — x 
when 0 < X < 1/2 and r(x) = x when 1/2 < x < 1. Hence, 

f 3 

E(y) = r{x)dF{x) = / {1 — x)dx / xdx = ~. m 

J Jo Jl /2 4 

Functions of several variables are handled in a similar way. If Z = r(X, Y) 
then 

E(Z) =E(r(X,V)) = ff r{x,y)dF{x,y). (3.4) 



3.9 Example. Let (X,Y) have a jointly uniform distribution on the unit 
square. Let Z = r{X,Y) = X^ + Y^ . Then, 



E{Z) = J J r{x,y)dF{x,y) = J J {x^ + y‘^)dxdy 

= [ x'^dxY [ y‘^dy = \. u 

Jo Jo ^ 

The moment of X is dehned to be E(X^) assuming that E(|X|^) < oo. 



3.10 Theorem. If the kf^ moment exists and if j < k then the moment 
exists. 



Proof. We have 

E|XP' 




\x\^fx{x)dx 
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' |a:|<l 



Vfx{x)dx+ [ \x\\fx{x)dx 
J\x\>l 



< / fx(x)dx+ / \x\’^fx{x)dx 

J\x\<l J\x\>l 

< 1 +E(|X|'=) < oo. ■ 



The central moment is defined to be E((X — 



3.2 Properties of Expectations 

3.11 Theorem. If Xi, . . . are random variables and are eon- 

stants, then 

E a*E(Xp. (3.5) 

3.12 Example. Let X ^ Binomial(n,p). What is the mean of X? We could 
try to appeal to the dehnition: 

E(X) = / xdFx{x) = '^xfx{x) = - p)"“® 

d X x=0 

but this is not an easy sum to evaluate. Instead, note that X = XlILi 
where = 1 if the toss is heads and X^ = 0 otherwise. Then E(X^) = 

(p X 1) + ((1 -p)x0)=p and E(X) = X,) = E. E(X,) = np. m 

3.13 Theorem. Let Xi, . . . ,X^ be independent random variables. Then, 

( 3 . 6 ) 

Notice that the summation rule does not require independence but the 
multiplication rule does. 



3.3 Variance and Covariance 

The variance measures the “spread” of a distribution. ^ 



^We can’t use E(X — /i) as a measure of spread since E(X — fi) = E(X) —11 = 11 — 11 = 0. 
We can and sometimes do use E|X — n\ a measure of spread but more often we use the 



variance. 




3.3 Variance and Covariance 
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3.14 Definition. Let X he a random variable with mean ja. The variance 
of X — denoted hy or orX{X) or NX — is defined by 

cr2 = E(X -fif = j{x- tifdF{x) (3.7) 

assuming this expeetation exists. The standard deviation is 
sd(X) = Y^V(X) and is also denoted by a and ox- 



3.15 Theorem. Assuming the variance is well defined, it has the following 
properties: 

1. V(X) =E(X2) -/i2. 

2. If a and b are constants then N{aX b) = a^V(X). 

3. If Xi, . . . , Xn are independent and a\, . . . ,an are constants, then 

( n \ n 

(3.8) 

i=l / i=l 

3.16 Example. Let X ^ Binomial(n,p). We write X = where Xi = 1 

if toss i is heads and Xi = 0 otherwise. Then X = random 

variables are independent. Also, P(X^ = 1) = p and P(X^ = 0) = 1 — p. Recall 
that 

E ( X ^) = (pxi\n({1-p)xo\ = p. 



Now, 

E(Xf) =(^px + (^(1 - p) X = p. 

Therefore, V(X^) = ^{Xf) — = p — p‘^ = p(l — p). Finally, V(X) = 

= E,K1 -P) = np{l-p). Notice that V(X) = 0 
if p = 1 or p = 0. Make sure you see why this makes intuitive sense. ■ 

If Xi , . . . , Xn are random variables then we dehne the sample mean to be 

— 1 "" 

X, = -Vx, (3.9) 

n 

i=l 

and the sample variance to be 

i=l 



(3.10) 
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3.17 Theorem. Let Xi , . . . be IID and let /i = = V(X^). Then 

2 

E(Xn) = ^i, V(X„) = ^ and E{Sl) = 

If X and Y are random variables, then the covariance and correlation be- 
tween X and Y measure how strong the linear relationship is between X and 

r. 

3.18 Definition. Let X and Y be random variables with means /ix CL'^d 
I^Y and standard deviations ax and cry. Define the covariance between 
X and Y by 

Coy{X,Y)=^{{X - ^ix){Y - ^iY)^ (3.11) 

and the correlation by 

P = PX,Y = p(X, X) = (3.12) 

axcry 

3.19 Theorem. The covariance satisfies: 

Cov(X, Y) = E{XY) - E(X)E(r). 

The correlation satisfies: 

-l<p{X,Y)<l- 

If Y = aX + b for some constants a and b then p{X^Y) = 1 if a > 0 and 
p{X,Y) = —1 if a < 0. If X and Y are independent, then Cov{X,Y) = p = 0. 
The converse is not true in general. 

3.20 Theorem. V(X + F) = V(X) + V(F) + 2Cov(X, Y) and V(X -Y) = 
V(X) +V(y) — 2Cov(X, y). More generally, for random variables Xi, . . . ,X^, 

W I ^ ^ a^Xi I = ^ ^ a‘^W i^Xfij -\- 2 Cov(Xj,Xi). 

\ i / i i<j 

3.4 Expectation and Variance of Important Random 
Variables 



Here we record the expectation of some important random variables: 
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Distribution 


Mean 


Variance 


Point mass at a 


a 


0 


Bernoulli (p) 


p 


p{i -p) 


Binomial(n,p) 


np 


np(l — p) 


Geometric (p) 


IIp 


(1 - p)/p^ 


Poisson(A) 


A 


A 


Uniform(a, b) 


(n T 5)/2 


(6-a)Vl2 


Normal(/i, cr^) 


/i 




Exponential (/3) 


P 




Gamma((a, f3) 


af3 


af3‘^ 


Beta(o, (3) 


Q. j (o T /3) 


ol(3 / ((o + /3)^((a (3 1)) 


U 


0 (if z/ > 1) 


cT 

A 

cT 

1 


xl 


p 


2p 


Multinomial(n, p) 


np 


see below 


Multivariate Normal(/i, E) 


p 


E 



We derived E(X) and V(X) for the Binomial in the previous section. The 
calculations for some of the others are in the exercises. 

The last two entries in the table are multivariate models which involve a 
random vector X of the form 



/ A-. N 



X = 



J 

The mean of a random vector X is dehned by 

/ Ml \ / E(Xi) \ 



M = 



\ l^k 



\ nxk) J 



The variance- covariance matrix E is defined to be 





■ V(Xi) 


Cov(Xi,X2) 


■■■ Cov(Xi,Xfe) 


V(X) = 


Cov(X2,Xi) 


V(X2) 


Cov(X2,Xfe) 




_ Cov(Xfc,Xi) 


Cov(Xfc,X2) 


V(Xfe) 


If X ^ Multinomial(n,p) then 


E{X) =np = 


n{pi,...,Pk) and 




npi(l -pi) 


-npip2 


■ ■ ■ -npiPk 


V(X) = 


-np2Pi 


np 2 {l -P 2 ) 


■ ■ • -np2Pk 




-npkPi 


-npkP2 


■■■ npk{l-pk) , 
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To see this, note that the marginal distribution of any one component of the 
vector Xi ^ Binomial(n,p^). Thus, E(X^) = npi and V(X^) = npi(l — Pi). 
Note also that Xi + Xj ^ Binomial(n, +Pj)- Thus, V(X^ + Xj) = n{pi + 
Pj){l — [pi + Pj])- On the other hand, using the formula for the variance 
of a sum, we have that V(X^ + Xj) = V(X^) + ^{Xj) + 2Cov{Xi,Xj) = 
npi{l — Pi) + npj{l — pj) + 2Cov(X^,Xj). If we equate this formula with 
n{pi +pj)(l — [pi +Pj]) and solve, we get Cov(X^,Xj) = —npiPj. 

Finally, here is a lemma that can be useful for Ending means and variances 
of linear combinations of multivariate random vectors. 

3.21 Lemma. If a is a vector and X is a random vector with mean p and 
variance S, then E(a^X) = a^ p and V(a^X) = a^Ea. If A is a matrix then 
E(AX) = Ap and Y{AX) = AT.A^ . 

3.5 Conditional Expectation 

Suppose that X and Y are random variables. What is the mean of X among 
those times when Y — yl The answer is that we compute the mean of X as 
before but we substitute fx\Y{Av) fx{x) in the dehnition of expectation. 



3.22 Definition. The conditional expectation of X given Y = y is 



E{X\Y = y) 



fx\Y{x\y) dx discrete case 
/ ^ fx\Y{^\y) dx continuous case. 



Ifr(x^y) is a function of x and y then 



E{r{X,Y)\Y = y) = 



fx\Y{^\y) dx discrete case 
f r{x,y) fx\Y{x\y) dx continuous case. 



Warning! Here is a subtle point. Whereas E(X) is a number, E(X|T = y) 
is a function of y. Before we observe T, we don’t know the value of E(X|T = y) 
so it is a random variable which we denote E(X|T). In other words, E(X|y) 
is the random variable whose value is E(X|y = y) when Y = y. Similarly, 
E(r(X, y)|y) is the random variable whose value is E(r(X, y)|T = y) when 
Y = y. This is a very confusing point so let us look at an example. 

3.23 Example. Suppose we draw X ^ Unif(0, 1). After we observe X = x, 
we draw Y\X = x ^ Unif(a:, 1). Intuitively, we expect that E(T|X = x) = 
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(1 + x)j2. In fact, fY\x{y\^) = 1/(1 — x) for x < y < 1 and 

E{Y\X=x) = J y fY\xi.y\x)dy = j y dy = 

as expected. Thus, E(T|X) = (1 +X)/2. Notice that E(T|X) = (1 +X)/2 is 
a random variable whose value is the number E(T|X = x) = (1 + x)/2 once 
X = X is observed. ■ 

3.24 Theorem (The Rule of Iterated Expectations). For random variables X 
and Y, assuming the expectations exist, we have that 

E [E(F|X)] = E(T) and E [E(X|T)] = E(X). (3.15) 

More generally, for any function r{x,y) we have 

E[E(r(X,y)|X)] =E(r(X,y)). (3.16) 



Proof. We’ll prove the hrst equation. Using the definition of conditional 
expectation and the fact that f{x,y) = f{x)f{y\x), 



E[E(T|X)] 



jE{Y\X = x)fx{x)dx = j J yf{y\x)dyf{x)dx 
[ [ yf{y\x)f{x)dxdy= [ [ yf{x,y)dxdy = E{Y). 



3.25 Example. Consider example 3.23. How can we compute E(T)? One 
method is to find the joint density f(x, y) and then compute E(T) = f f yf{x, y)dxdy. 
An easier way is to do this in two steps. First, we already know that E(T|X) = 
(l+X)/2. Thus, 



E(T) = EE(T|X) = E 



(l + X) 



(1+E(X)) _ (l + (l/2)) 



= 3/4. 



2 2 

3.26 Definition. The conditional variance is defined as 

V(X|X = x) = j{y- y{x)ff{y\x)dy 
where y{x) = E(T|X = x). 



(3.17) 



3.27 Theorem. For random variables X and Y , 



v(y) = EV(y |x) + VE(F |x). 
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3.28 Example. Draw a county at random from the United States. Then draw 
n people at random from the county. Let X be the number of those people 
who have a certain disease. If Q denotes the proportion of people in that 
county with the disease, then Q is also a random variable since it varies from 
county to county. Given Q = q, we have that X ^ Binomial(n, Thus, 
1K{X\Q = q) = nq and V(X|(5 = q) = nq{l — q). Suppose that the random 
variable Q has a Uniform (0,1) distribution. A distribution that is constructed 
in stages like this is called a hierarchical model and can be written as 

Q ^ Uniform(0, 1) 

X\Q = q Binomial(n, q). 

Now, E(X) = ]KE{X\Q) = E(nQ) = nE{Q) = n/2. Let us compute the 
variance of X. Now, V(X) = EV(X|Q) + VE(X|Q). Let’s compute these 
two terms. First, EV(X|(5) = E[nQ(l — Q)] = nE{Q{l — Q)) = n f q(l — 
q)f{q)dq = n q{l - q)dq = n/6. Next, VE(X|g) = V(nQ) = n‘^N{Q) = 
n‘^ f{q — (l/2))‘^dq = n^/12. Hence, V(X) = (n/6) + (n^/12). ■ 



3.6 Moment Generating Functions 



Now we will dehne the moment generating function which is used for hnding 
moments, for hnding the distribution of sums of random variables and which 
is also used in the proofs of some theorems. 



3.29 Definition. The moment generating function MGF, or Laplace 
transform, of X is defined by 

V-xC) =E(e*^) = J r^dF{x) 

where t varies over the real numbers. 



In what follows, we assume that the MGF is well dehned for all t in some 
open interval around t = 0. ^ 

When the MGF is well dehned, it can be shown that we can interchange the 
operations of differentiation and “taking expectation.” This leads to 



no) 





= E 




dt 


t=o 


dt 



E [XV^' 



t=0 



E(X). 



related function is the characteristic function, defined by where i = \/—l. This 

function is always well defined for all t. 
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By taking k derivatives we conclude that = E(X^). This gives us a 

method for computing the moments of a distribution. 



3.30 Example. Let X ^ Exp(l). For any t < 1, 

nOO nOO -i 

V'x(i) = = J e*^e~^dx = J 

The integral is divergent if t > 1. So, 'ipxit) = 1/(1 — t) for all t < 1. Now, 
V^'(O) = 1 and V^"(0) = 2. Hence, E(X) = 1 and V(X) = = 2 - 1 = 

1 . ■ 



3.31 Lemma. Properties of the MGF. 

(1) IfY = aX + b, then fjy (t) = (ctt) . 

(2) If Xi^ ^ Xn are independent and Y = Xi, then ^t{t) 

where ^|Ji is the MGF of Xi . 

3.32 Example. Let X ^ Binomial(n,p). We know that X = where 

E(X^ = 1) = p and P(X^ = 0) = 1 — p. Now fji{t) — Ee^^^ = (p x + ((1 
j)] ] = pe^ P q where q = 1 — p. Thus, i/Jx{t) = Yli P q)^. ■ 

Recall that X and Y are equal in distribution if they have the same distri- 
bution function and we write X = Y. 



3.33 Theorem. Let X and Y be random variables. If i/jxit) = for all t 

in an open interval around 0, then X = Y . 

3.34 Example. Let Xi ^ Binomial(ni,p) and X 2 ^ Binomial(n 2 ,p) be inde- 
pendent. Let Y = Xi + X 2 . Then, 

X{t) = = (pe* + qrHpr + qr = {pV + 

and we recognize the latter as the MGF of a Binomial(ni + ri2,p) distribu- 
tion. Since the MGF characterizes the distribution (i.e., there can’t be an- 
other random variable which has the same mgf) we conclude that Y ^ 
Binomial(ni + n 2 ,p). ■ 
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1 Moment Generating Functions for Some Common Distributions I 


Distribution 


MGF ^(t) 


Bernoulli (p) 


pe^ + (1 — p) 


Binomial(n,p) 


{pe* + (1 -p))” 


Poisson(A) 


gA(e‘-l) 


Normal(/i,cr) 


exp 


Gamma(a,j6) 


(i_y) fort <1/13 



3.35 Example. Let Yi ^ Poisson(Ai) and Y 2 ^ Poisson(A 2 ) be independent. 
The moment generating function of T = Ti + T + 2 is 'ipv(t) = {t)'^V 2 (t) = 

gAi(e^-i)gA 2 (e^-i) _ g(Ai+A 2 )(e^-i) moment generating function 

of a Poisson(Ai + A 2 ). We have thus proved that the sum of two independent 
Poisson random variables has a Poisson distribution. ■ 



3.7 Appendix 

Expectation as an Integral. The integral of a measurable function r{x) 
is dehned as follows. First suppose that r is simple, meaning that it takes 
hnitely many values ai, . . . , over a partition Ai, . . . , Then define 

/ k 

r{x)dF{x) = y]aiP(r(X) G Ai). 

i=l 

The integral of a positive measurable function r is defined by J r{x)dF{x) = 
lim^ f ri{x)dF{x) where is a sequence of simple functions such that ri{x) < 
r{x) and Ti{x) r{x) as i ^ 00 . This does not depend on the particular se- 
quence. The integral of a measurable function r is defined to be J r{x)dF{x) = 
f r~^(x)dF(x)—f r~ (x)dF(x) assuming both integrals are finite, where r~^(x) = 
max{r(d:),0} and r~(x) = — min{r(T), 0}. 



3.8 Exercises 

1. Suppose we play a game where we start with c dollars. On each play of 
the game you either double or halve your money, with equal probability. 
What is your expected fortune after n trials? 
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2. Show that V(X) = 0 if and only if there is a constant c such that 
P(X = c) = l. 

3. Let Xi, . . . , Xn ^ Uniform(0, 1) and let Yn = max{Xi, . . . , Find 

E(y„). 

4. A particle starts at the origin of the real line and moves along the line in 
jumps of one unit. For each jump the probability is p that the particle 
will jump one unit to the left and the probability is 1—p that the particle 
will jump one unit to the right. Let X^, be the position of the particle 
after n units. Find lE(X^) and V(X^). (This is known as a random 
walk.) 

5. A fair coin is tossed until a head is obtained. What is the expected 
number of tosses that will be required? 

6. Prove Theorem 3.6 for discrete random variables. 

7. Let X be a continuous random variable with CDF F. Suppose that 
P(X > 0) = 1 and that E(X) exists. Show that E(X) = E(X > 
x)dx. 

Hint: Consider integrating by parts. The following fact is helpful: if E(X) 
exists then lima^^oo x[l — F{x)] = 0. 

8. Prove Theorem 3.17. 

9. (Computer Experiment.) Let Xi, X 2 , . . . , X^ be X(0, 1) random variables 

and let X^ = versus n for n = 1, ..., 10^ 000- 

Repeat for Xi, X 2 , . . . , Xn ^ Cauchy. Explain why there is such a dif- 
ference. 

10. Let X - X(0, 1) and let T = e^. Find E(T) and V(T). 

11. (Computer Experiment: Simulating the Stock Market.) Let be 

independent random variables such that P(Yi = 1) = P{Yi = —1) = 
1/2. Let Xn = Think of = 1 as “the stock price increased 

by one dollar”, = — 1 as “the stock price decreased by one dollar”, 

and Xn as the value of the stock on day n. 

(a) Find E(X^) and V(X^). 

(b) Simulate Xn and plot Xn versus n for n = 1, 2, . . . , 10, 000. Repeat 
the whole simulation several times. Notice two things. First, it’s easy 
to “see” patterns in the sequence even though it is random. Second, 
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you will find that the four runs look very different even though they 
were generated the same way. How do the calculations in (a) explain 
the second observation? 

12. Prove the formulas given in the table at the beginning of Section 3.4 

for the Bernoulli, Poisson, Uniform, Exponential, Gamma, and Beta. 
Here are some hints. For the mean of the Poisson, use the fact that 
e" = To compute the variance, first compute E{X{X — 1)). 

For the mean of the Gamma, it will help to multiply and divide by 
P((a + 1)//?^+^ and use the fact that a Gamma density integrates to 1. 
For the Beta, multiply and divide by P((a + l)r{f3)/r{a + /? + 1). 

13. Suppose we generate a random variable X in the following way. First 
we flip a fair coin. If the coin is heads, take X to have a Unif(0,l) 
distribution. If the coin is tails, take X to have a Unif(3,4) distribution. 

(a) Find the mean of X. 

(b) Find the standard deviation of X. 

14. Let Xi, . . . , Xm and Fi, . . . , IG be random variables and let ai, . . . , 
and 6i, . . . , be constants. Show that 

( m n \ m n 

a^bjCoy{X,,Yj). 

i=i j=i J i=i j=i 

15. Let 

fx,Y{x,y) = 1 3 (^ + 2 /) 

Find V(2X-3y + 8). 

16. Let r(x) be a function of x and let s(y) be a function of y. Show that 
E{r{X)s{Y)\X) =r{X)E{s{Y)\X). 

Also, show that E(r(X)|X) = r(X). 

17. Prove that 

V(F) = E V(F I X) + VE(F | X). 

Hint: Let m = E(F) and let b{x) = E(F|X = x). Note that E(5(X)) = 
EE(F|X) = E(y) = m. Bear in mind that 5 is a function of x. Now 
write V(F) = E(F - m)^ = E((F - b{X)) + (5(X) - m))^. Expand the 



0<x<l, 0<^<2 
otherwise. 
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square and take the expectation. You then have to take the expectation 
of three terms. In each case, use the rule of the iterated expectation: 
E(stuff) =E(E(stuff|X)). 

18. Show that if E(X|Y = y) = c for some constant c, then X and Y are 
uncorrelated. 

19. This question is to help you understand the idea of a sampling dis- 
tribution. Let be IID with mean fi and variance cr^. Let 

Xn — n~^ Then X^ is a statistic, that is, a function of the 

data. Since Xn is a random variable, it has a distribution. This distri- 
bution is called the sampling distribution of the statistic. Recall from 
Theorem 3.17 that E(X^) = fi and V(X^) = cr^/n. Don’t confuse the 
distribution of the data fx and the distribution of the statistic . To 
make this clear, let Xi, . . . , X^ ^ Uniform(0, 1). Let fx be the density 
of the Uniform(0, 1). Plot fx- Now let X^ = n~^ ZlILi Find E(X^) 
and V(X^). Plot them as a function of n. Interpret. Now simulate the 
distribution of Xn for n = 1,5, 25, 100. Check that the simulated values 
of E(X^) and V(X^) agree with your theoretical calculations. What do 
you notice about the sampling distribution of Xn as n increases? 

20. Prove Lemma 3.21. 

21. Let X and Y be random variables. Suppose that E(Y|X) = X. Show 
that Cov(X,Y) = V(X). 

22. Let X Uniform(0, 1). Let 0 < a < 5 < 1. Let 

Y _ i ^ 0<x<5 

I 0 otherwise 

and let 

^_fl a < X < 1 
\ 0 otherwise 

(a) Are Y and Z independent? Why/ Why not? 

(b) Find E(Y|Z). Hint: What values ^ can Z take? Now hnd E(Y|Z = z). 

23. Find the moment generating function for the Poisson, Normal, and 
Gamma distributions. 

24. Let Xi, . . . , Xn ^ Exp(/3). Find the moment generating function of X^. 
Prove that XlILi Gamma(n, /?). 




4 

Inequalities 



4.1 Probability Inequalities 

Inequalities are useful for bounding quantities that might otherwise be hard 
to compute. They will also be used in the theory of convergence which is 
discussed in the next chapter. Our hrst inequality is Markov’s inequality. 



4.1 Theorem (Markov’s inequality). Let X be a non-negative random 
variable and suppose that E(X) exists. For any t > 0, 



P(X >t) < 



E(X) 

t 



(4.1) 



Proof. Since X > 0, 



f'OO pt pOO 

E(X) = / xf{x)dx = / xf{x)dx + / xf{x)dx 

Jo Jo Jt 

/ oo pOO 

xf{x)dx ^ ^ y f{x)dx = t¥{X > t) 
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4.2 Theorem (Chebyshev’s inequality). Let = E(X) and = V(X). 

Then, 

n\X - ^i\>t) <'^ and n\Z\>k)<^ (4.2) 

where Z = {X — iTjja. In particular, ^{\Z\ > 2) < 1/4 and 
¥{\Z\ > 3) < 1/9. 



Proof. We use Markov’s inequality to conclude that 

p(ix - Ml > 0 = > r) < 

The second part follows by setting t = ka. m 



4.3 Example. Suppose we test a prediction method, a neural net for example, 
on a set of n new test cases. Let = 1 if the predictor is wrong and = 0 
if the predictor is right. Then X^ = n~^ XlILi observed error rate. 

Each Xi may be regarded as a Bernoulli with unknown mean p. We would 
like to know the true — but unknown — error rate p. Intuitively, we expect 
that Xn should be close to p. How likely is X^ to not be within e of p? We 
have that V(X^) = V(Xi)/n = p{l — p)/n and 



F(|X^ 



P\> e) < 



V(X„) 

^2 



P(1 -P) ^ 1 

ne^ “ 4ne^ 



since p{l — p) < \ for all p. For e = .2 and n = 100 the bound is .0625. ■ 



Hoeffding’s inequality is similar in spirit to Markov’s inequality but it is a 
sharper inequality. We present the result here in two parts. 



4.4 Theorem (HoefFding’s Inequality). Let Ti, . . . , W independent 
observations such that 

E(T^) = 0 and ai < Yi < bi. Let e > 0. Then, for any t > 0, 

( n \ n 

J2Yi>ej < e-*^ 



(4.3) 
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4.5 Theorem. Let Xi, . . . , Xn ^ Bernoulli (p). Then, for any e > 0, 

P(|X„-p| >e) <2e-2"^' (4.4) 

where = n~^ 

4.6 Example. Let Xi,...,Xn ^ Bernoulli(p). Let n = 100 and e = .2. We 
saw that Chebyshev’s inequality yielded 

F{\Xr^-p\ > e) < .0625. 

According to Hoeffding’s inequality, 

F{\Xn-p\ > .2) < 2e“2(ioo)(-2)" = .00067 

which is much smaller than .0625. ■ 

Hoeffding’s inequality gives us a simple way to create a confidence inter- 
val for a binomial parameter p. We will discuss confidence intervals in detail 
later (see Chapter 6) but here is the basic idea. Fix o > 0 and let 



^n. — 



^log(- 

2n \ a 



By Hoeffding’s inequality, 

P(|X„-p| >e„) <2e-2”^" =a. 

Let C = (X„ -e„,X„ + e„). Then, P(p ^ C) = P(|X„ -p| > e„) < a. Hence, 
P(p e C) > 1 — a, that is, the random interval C traps the true parameter 
value p with probability 1 — a; we call C a 1 — o confidence interval. More on 
this later. 

The following inequality is useful for bounding probability statements about 
Normal random variables. 



4.7 Theorem (Mill’s Inequality). Let Z ~X(0, 1). Then, 



¥{\Z\ >t)< 



2e^ 

TV t 
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4.2 Inequalities For Expectations 

This section contains two inequalities on expected values. 



4.8 Theorem (Cauchy-Schwartz inequality). If X andY have finite 
varianees then 



E|Xr| < VE(X2)E(T2; 



(4.5) 



Recall that a function g is convex if for each x, y and each a G [0, 1], 

g[ax + (1 - a)y) < ag{x) + (1 - a)g{y). 

If g is twice differentiable and g"{x) > 0 for all x, then g is convex. It can 
be shown that if g is convex, then g lies above any line that touches g at 
some point, called a tangent line. A function g is concave if —g is convex. 
Examples of convex functions are g{x) = x‘^ and g{x) = e^. Examples of 
concave functions are g{x) = —x‘^ and g{x) = logx. 



4.9 Theorem (Jensen’s inequality). If g is convex, then 

E^(X) > g{EX). 



If g is concave, then 



E^(X) < ^(EX). 



Proof. Let L{x) = a + be a line, tangent to g{x) at the point E(X). 
Since g is convex, it lies above the line L{x). So, 

Eg{X) > EL{X) = E(a + 5X) = a + bE{X) = L{E{X)) = g{EX). m 

Erom Jensen’s inequality we see that E(X^) > (EX)^ and if X is positive, 
then E(l/X) > 1/E(X). Since log is concave, E(logX) < logE(X). 

4.3 Bibliographic Remarks 

Devroye et al. (1996) is a good reference on probability inequalities and their 
use in statistics and pattern recognition. The following proof of Hoeffding’s 
inequality is from that text. 
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Proof of Hoeffding’s Inequality. We will make use of the exact form of 
Taylor’s theorem: if ^ is a smooth function, then there is a number ^ G (0,n) 
such that g{u) = ^( 0 ) + ug'{0) + ^g 

Prooe of Theorem 4.4. For any t > 0, we have, from Markov’s inequality, 
that 






0=1 



= P p Yj > te 

V *=i y 

< e“‘"E 



= e-*']qE(e*^-). 



(4.8) 



Since ai < Yi < bi, we can write Yi as a convex combination of and 6 ^, 
namely, Yi = abi + (1 — a)ai where a = (Yi — ai)/(bi — a^). So, by the 
convexity of we have 



r . tY ^ 



< 



Yi — ai 



^tbi 



h. - Y, 



^tai 



bi — ai bi — ai 

Take expectations of both sides and use the fact that E(T^) = 0 to get 



^gtYi < ( 4 . 9 ) 

bi — ai bi — ai 

where u = t{bi — a^), g{u) = —ju + log(l — 7 + and 7 = —ai/(bi — ai). 

Note that ^( 0 ) = ^'( 0 ) = 0 . Also, g (u) < 1/4 for all u > 0 . By Taylor’s 
theorem, there is a G ( 0 , u) such that 



Hence, 



g{u) = g{Q) + ug ( 0 ) + —g (^) 



Uj " , Uj 

Y9 iO < y ~ 



t^{bi - aiY 






The result follows from (4.8). ■ 

Prooe of Theorem 4.5. Let Yi = (l/n)(X^ — p). Then E(T^) = 0 and 
a < Yi < b where a = —p/n and 5 = (1 — p)/n. Also, {b — a)^ = 1 /n^. 
Applying Theorem 4.4 we get 



P(X^ - p > e) = 

The above holds for any t > 0. In particular, take t = 4ne and we get P(X^ — 
p > e) < . By a similar argument we can show that P(X^ — p < — e) < 

g- 2 ne^ Putting these together we get P (|X^ — p| > e) < ■ 
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4.5 Exercises 

1. Let X ^ Exponent ial(/3). Find P(|X — /ix| > kax) for k > 1. Compare 
this to the bound you get from Chebyshev’s inequality. 

2. Let X ~ Poisson(A). Use Chebyshev’s inequality to show that P(X > 
2A) < 1/A. 

3. Let Xi, . . . , Xn ^ Bernoulli(p) and X^ = n~^ ZlILi Bound F(|X^ — 
p\ > e) using Chebyshev’s inequality and using Hoeffding’s inequality. 
Show that, when n is large, the bound from Hoeffding’s inequality is 
smaller than the bound from Chebyshev’s inequality. 

4. Let Xi, . . . , X^ ^ Bernoulli(p). 

(a) Let ce > 0 be hxed and dehne 




Let p^ = n ^ Sr=i Define Cn = {Vn ~ Pn + Use Hoeffding’s 
inequality to show that 

P(Cn contains p) >1 — a. 

In practice, we truncate the interval so it does not go below 0 or above 

1 . 

(b) (Computer Experiment.) Let’s examine the properties of this confi- 
dence interval. Let a = 0.05 and p = 0.4. Conduct a simulation study 
to see how often the interval contains p (called the coverage). Do this 
for various values of n between 1 and 10000. Plot the coverage versus n. 

(c) Plot the length of the interval versus n. Suppose we want the length 
of the interval to be no more than .05. How large should n be? 

5. Prove Mill’s inequality. Theorem 4.7. Hint. Note that F(|Z| > t) = 
2F(Z > t). Now write out what F(Z > t) means and note that x/t > 1 
whenever x > t. 

6. Let Z ^ X(0, 1). Find F(|Z| > t) and plot this as a function of t. From 

Markov’s inequality, we have the bound F(|Z| > t) < for any 

k > 0. Plot these bounds for k = 1,2, 3, 4, 5 and compare them to the 
true value of F(|Z| > t). Also, plot the bound from Mill’s inequality. 
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7. Let ^ A^(0, 1). Bound P(|X^| > t) using Mill’s inequality, 

where = n~^ Y^7=i Compare to the Chebyshev bound. 




5 

Convergence of Random Variables 



5.1 Introduction 

The most important aspect of probability theory concerns the behavior of 
sequences of random variables. This part of probability is called large sample 
theory, or limit theory, or asymptotic theory. The basic question is this: 
what can we say about the limiting behavior of a sequence of random variables 
Xi, X 2 , X 3 , . . .? Since statistics and data mining are all about gathering data, 
we will naturally be interested in what happens as we gather more and more 
data. 

In calculus we say that a sequence of real numbers Xn converges to a limit 
X if, for every e > 0, \xn — x\ < e for all large n. In probability, convergence is 
more subtle. Going back to calculus for a moment, suppose that Xn = x for 
all n. Then, trivially, lim^^oo Xn = x. Consider a probabilistic version of this 
example. Suppose that Xi,X 2 ,... is a sequence of random variables which 
are independent and suppose each has a X(0, 1) distribution. Since these all 
have the same distribution, we are tempted to say that X^ “converges” to 
X ^ X(0, 1). But this can’t quite be right since P(X^ = X) = 0 for all n. 
(Two continuous random variables are equal with probability zero.) 

Here is another example. Consider Xi, X 2 , . . . where X^ ^ X(0, 1/n). Intu- 
itively, Xn is very concentrated around 0 for large n so we would like to say 
that Xn converges to 0. But P(X^ = 0) = 0 for all n. Clearly, we need to 
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develop some tools for discussing convergence in a rigorous way. This chapter 
develops the appropriate methods. 

There are two main ideas in this chapter which we state informally here: 

1. The law of large numbers says that the sample average Xn = 
converges in probability to the expectation jjl = E ( X ^). This means 
that Xn is close to /i with high probability. 

2. The central limit theorem says that ^/n{Xn — jj) converges in dis- 
tribution to a Normal distribution. This means that the sample average 
has approximately a Normal distribution for large n. 



5.2 Types of Convergence 

The two main types of convergence are dehned as follows. 



5.1 Definition. Let Xi,X 2 , ... he a sequence of random variables and let 
X be another random variable. Let denote the CDF of X^ and let F 
denote the CDF of X. 



1. Xn converges to X in probability, written Xn — >X, if for every 

e > 0, 

P(|X,-X| >e)^0 (5.1) 



as n ^ oo. 



2. Xn converges to X in distribution, written Xn ^ X, if 



lim Fn{t) = F{t) (5.2) 

n^oo 

at all t for which F is continuous. 



When the limiting random variable is a point mass, we change the notation 
slightly. If P(X = c) = 1 and Xn~^ X then we write Xn^-^ c. Similarly, if 
Xn ^ X we write Xn ^ c. 

There is another type of convergence which we introduce mainly because it 
is useful for proving convergence in probability. 



5.2 Types of Convergence 
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FIGURE 5.1. Example 5.3. Xn converges in distribution to X because Fn{t) con- 
verges to F{t) at all points except t = 0. Convergence is not required at t = 0 
because t = 0 is not a point of continuity for F. 



5.2 Definition. X^ converges to X in quadratic mean (also called 
convergence in L 2 ), written if 

E(V„ -Xf ^0 (5.3) 

as n ^ oo. 



Again, if X is a point mass at c we write X^-^ c instead of X^-^ X. 

5.3 Example. Let X^ ^ X(0, 1/n). Intuitively, X^ is concentrating at 0 so 
we would like to say that X^ converges to 0. Let’s see if this is true. Let F be 
the distribution function for a point mass at 0. Note that ^/nX^ V(0,1). 

Let Z denote a standard normal random variable. For t < 0, =P(X^ < 

t) = ¥{yJnXn < \/nt) = P(Z < y^t) 0 since ^/nt — oo. For t > 0, 
Fn{t) = F(X^ < t) = P(v^X^ < y/nt) = P(Z < ^/nt) 1 since ^/nt oo. 

Hence, ^ F{t) for alH 0 and so X^ 0. Notice that F^(0) = 1/2 

F(l/2) = 1 so convergence fails at t = 0. That doesn’t matter because t = 0 
is not a continuity point of F and the dehnition of convergence in distribution 
only requires convergence at continuity points. See Figure 5.1. Now consider 
convergence in probability. For any e > 0, using Markov’s inequality. 




oo. Hence, X„ 



The next theorem gives the relationship between the types of convergence. 
The results are summarized in Figure 5.2. 

5.4 Theorem. The following relationships hold: 

(a) Xn X implies that Xn — ^ X . 

(b ) Xn — ^ X implies that Xn ^ X . 

P 

(e) If Xn ^ X and i/P(X = c) = 1 for some real number c, then Xn — ^ X. 
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In general, none of the reverse implieations hold exeept the speeial ease in 

(c). 

Proof. We start by proving (a). Suppose that Fix e > 0. Then, 

using Markov’s inequality, 

P(|V„ - X| > e) = P(|X„ - X|2 > £2) < ^ 0^ 

Proof of (b). This proof is a little more complicated. You may skip it if you 
wish. Fix e > 0 and let x be a continuity point of F. Then 

^ x) = < x,X < X + e) + < x,X > x + e) 

< P(X<x + e)+P(|X^-X| >e) 

= F(x + e)+P(|X^-X| >e). 

Also, 

F{x — e) = P(X < X — e) = P(X < x — e, X^ < x) + P(X < x — e, X^ > x) 

< Fn{x)Fn\Xn-X\>e). 

Hence, 

F{x - e) - P(|X, - X| > e) < F,(x) < F(x + e) + P(|X, - X| > e). 

Take the limit as n ^ oo to conclude that 

F{x — e) < lim inf F^ (x) < lim sup F^ (x) < F(x + e) . 

n^oo 

This holds for all e > 0. Take the limit as e ^ 0 and use the fact that F is 
continuous at x and conclude that lim^ Fn{x) = F(x). 

Proof of (c). Fix e > 0. Then, 

P(|A^n “ c| > e) = P(Y^ < c — e) + P(Y^ > c + e) 

< ~P[Xfi ^ c — e) + ~P[Xfi > c + e) 

= Fn{c — e) F I — Fn{c F e) 

F{c — e) + 1 — F{c F e) 

= 0 + 1 -1 = 0 . 

Let us now show that the reverse implications do not hold. 

Convergence in probability does not imply convergence in quadratic 
MEAN. Let U ^ Unif(0, 1) and let Xn = \/R/(o,i/n)(^)- Then P(|X^| > e) = 




5.2 Types of Convergence 75 



point-mass distribution 



quadratic mean > probability > distribution 

FIGURE 5.2. Relationship between types of convergence. 



P(v/n/(o,i/n)(^) > ^ U < 1/n) = 1/n ^ 0. Hence, But 

E(X^) = n du = 1 for all n so does not converge in quadratic mean. 

Convergence in distribution does not imply convergence in prob- 
ability. Let X ^ X(0, 1). Let X^ = -X for n = 1,2,3,...; hence X^ ^ 
X(0, 1). Xn has the same distribution function as X for all n so, trivially, 
lim^F^(a:) = F{x) for all x. Therefore, X^ X. But P(|X^ — X| > e) = 
P(|2X| > e) = P(|X| > e/2) 0. So X^ does not converge to X in probability. 

■ 

P 

Warning! One might conjecture that if X^ — ^ 5, then E(X^) ^ b. This is 
not^ true. Let X^ be a random variable dehned by P(X^ = n^) = Ijn and 
P(X, = 0) = 1 - (1/n). Now, P(|X,| < e) = P(X, = 0) = 1 - (1/n) ^ 1. 
Hence, X^^oO. However, E(X^) = [n^ x (1/n)] + [0 x (1 — (1/n))] = n. Thus, 
— y oo. 

Summary. Stare at Figure 5.2. 

Some convergence properties are preserved under transformations. 

5.5 Theorem. Let X^,X, be random variables. Let g be a eontinuous 

function. 

(a) // V„ A V and A Y, then X„ + y„ A V + V . 

(b) //X„AX andYn^Y, then X„ + Y„^ X + Y . 

(c) If Xn ^ X and c, then X^ + X + c. 

(d) IfXn^X and F„Ar, then X„V„ AxV. 

(e) If Xn ^ X and ^ c, then X^Y^ ^ cX. 

(f) IfXn^X, theng{XnXg{X). 

(g) If Xn - X, then g{Xn) - g{X). 

Parts (c) and (e) are know as Slutzky’s theorem. It is worth noting that 
Xn ^ X and Yn Y does not in general imply that X^ -\- Yn X -\- Y . 



^We can conclude that E(Yr^) ^ 5 if Xn is uniformly integrable. See the appendix. 
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5.3 The Law of Large Numbers 

Now we come to a crowning achievement in probability, the law of large num- 
bers. This theorem says that the mean of a large sample is close to the mean 
of the distribution. For example, the proportion of heads of a large number 
of tosses is expected to be close to 1/2. We now make this more precise. 

Let Xi,X2, . . . be an IID sample, let /i = E(Xi) and ^ = V(Xi). Recall 

that the sample mean is dehned as Xn = n~^ XlILi E(X^) = /i 

and Y(Xn) = (y^/n. 

5.6 Theorem (The Weak Law of Large Numbers (WLLN)). ^ 

— p 

J/Xi,...,X^ are IID, then Xn — ^ /i. 



Interpretation of the WLLN: The distribution of Xn becomes more 
concentrated around /i as n gets large. 



Proof. Assume that a < oo. This is not necessary but it simplihes the 
proof. Using Chebyshev’s inequality. 



¥{\X„-ii\>e) < 
which tends to 0 as n ^ oo. ■ 




5.7 Example. Consider flipping a coin for which the probability of heads is 
p. Let Xi denote the outcome of a single toss (0 or 1). Hence, p = P{Xi = 
1) = E{Xi). The fraction of heads after n tosses is According to the law 
of large numbers, Xn converges to p in probability. This does not mean that 
Xn will numerically equal p. It means that, when n is large, the distribution 
of Xn is tightly concentrated around p. Suppose that p = 1/2. How large 
should n be so that P(.4 < Xn < .6) > .7? First, E(X^) = p = 1/2 and 
Y{Xn) = (t‘^ I n = p(l — p)/n = l/(4n). From Chebyshev’s inequality, 

P(.4<X^<.6) = F(|X^-/i|<.l) 

= l-P(|X,-/i| > .1) 

1 _ 25 

“ 4n(.l)2 n * 

The last expression will be larger than .7 if n = 84. ■ 



^Note that n = IE(VL is the same for all i so we can define n = IE(VL for any i. By 
convention, we often write n = lE(Vi). 

^There is a stronger theorem in the appendix called the strong law of large numbers. 
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5.4 The Central Limit Theorem 

The law of large numbers says that the distribution of Xn piles up near /i. 
This isn’t enough to help us approximate probability statements about X^. 
For this we need the central limit theorem. 

Suppose that Xi, . . . , X^ are IID with mean /i and variance cr^. The central 
limit theorem (CLT) says that X^ = n~^ a distribution which is 

approximately Normal with mean fi and variance jn. This is remarkable 
since nothing is assumed about the distribution of except the existence of 
the mean and variance. 

5.8 Theorem (The Central Limit Theorem (CLT)). Let Xi, . . . , X^ he IID 

with mean ji and variance . Let X^ = n~^ '^hen 




where Z ^ X(0, 1). In other words, 

/ Z -1 

e~^ ^‘^dx. 

-oo V27T 



Interpretation: Probability statements about X^ can be approximated 
using a Normal distribution. It’s the probability statements that we 
are approximating, not the random variable itself. 

In addition to ^(0, 1), there are several forms of notation to denote 

the fact that the distribution of is converging to a Normal. They all mean 
the same thing. Here they are: 
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the number of errors in the programs. We want to approximate < 5.5). 

Let /i = E(Xi) = A = 5 and = V(Xi) = A = 5. Then, 



< 5.5) 



■s/n{Xn - m) ^ \/n(5.5 - ji) 



P(Z < 2.5) = .9938. 



The central limit theorem tells us that = ^/n{Xn—iJ ,) /ct is approximately 
N(0,1). However, we rarely know a. Later, we will see that we can estimate 
(T^ from Xi , . . . , V„ by 



i=l 

This raises the following question: if we replace a with Sn, is the central limit 
theorem still true? The answer is yes. 

5.10 Theorem. Assume the same conditions as the CLT. Then, 




^ V(0,1). 



You might wonder, how accurate the normal approximation is. The answer 
is given in the Berry-Esseen theorem. 

5.11 Theorem (The Berry-Esseen Inequality). Suppose that¥.\Xi\^ < oo. Then 
sup |P(Z„ < ^) - $(z)| < (5.4) 

There is also a multivariate version of the central limit theorem. 



5.12 Theorem (Multivariate central limit theorem). 

dom vectors where 









Let Xi , . . . , Xn be IID ran- 



with mean 



V w, / 



Ti \ 




/ E(Vh) \ 


T2 




nx2i) 


Tk / 




[ nxk^) ) 




5.5 The Delta Method 
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and variance matrix E. Let 



( \ 




\Xk / 

where Xj = Then, 

yTl(X -fi) AT(0,E). 



5.5 The Delta Method 

If Yn has a limiting Normal distribution then the delta method allows us to 
find the limiting distribution of g{Yn) where g is any smooth function. 



5.13 Theorem (The 


Delta Method). Suppose that 




a 


and that g is a differentiable function such that g'{g) / 0. Then 




^^a{Yn)-g{n)) 

\9'{VW ^ 


In other words, 




/ 2 \ 


/ 2 \ 




implies that g{Yn) w N g{fi), {g'{)j)f— ■ 


V 


\ ^ / 



5.14 Example. Let Xi, . . . ,X^ be IID with finite mean fi and finite variance 

cr^. By the central limit theorem, ^/n{Xn — g)/ a N{0, 1). Let Wn = . 

Thus, Wn = g{Xn) where g{s) = e^. Since g'{s) = e^, the delta method 
implies that Wn ~ A^(e^, e^^cr^/n). ■ 

There is also a multivariate version of the delta method. 

5.15 Theorem (The Multivariate Delta Method). Suppose thatYn = {Yni , . . . jYnk) 
is a sequence of random vectors such that 



MYn - NiO,^). 
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Let g and let 



V5(y) = 

Let denote V g{y) evaluated at y 
are nonzero. Then 



‘ dyi ' 



I ^ / 

\ dyk / 

fjL and assume that the elements of 



MgiYn) - g{iy) - N (0, vjsv^) 



5.16 Example. Let 



Vii 

V 21 



X 12 

V 22 



Vi„ 

V2„ 



be IID random vectors with mean fi = ( 111 , 112 )^ and variance S. Let 



- /6 - /6 

Vi = -VXh, V2 = -VX2i 
n n 

i=l i=l 



and define = X 1 X 2 . Thus, = ^(Xi, X 2 ) where g{si^ S 2 ) = 5iS2. By the 
central limit theorem, 






Now 



and so 



V5(s) 



= (/^2 Ml) 



CTi2 
C^12 CT22 



Therefore, 



h2 

til 



S2 

Si 



Mil'll + 2 MiM20-12 + Mi<^22- 



x/n(ViV 2 - M 1 M 2 ) V( 0,MiCTii + 2 /xi/U2(Ti 2 + Mi<^22)- ■ 



5.6 Bibliographic Remarks 

Convergence plays a central role in modern probability theory. For more de- 
tails, see Grimmett and Stirzaker (1982), Karr (1993), and Billingsley (1979). 
Advanced convergence theory is explained in great detail in van der Vaart 
and Wellner (1996) and and van der Vaart (1998). 
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5.7 Appendix 

5.7.1 Almost Sure and Li Convergence 

We say that converges almost surely to X, written if 

P({s: X„(s) ^ X(s)}) = 1. 

We say that X^ converges in L\ to X, written X^-^X, if 

E|X^-X| ^0 

as n ^ oo. 

5.17 Theorem. Let X^ and X be random variables. Then: 

(a) X^-^X implies that X^-^ X . 

(b) implies that X„^X. 

(c) Xn^^ X implies that X^^aX. 

The weak law of large numbers says that X^ converges to E(Xi) in proba- 
bility. The strong law asserts that this is also true almost surely. 

5.18 Theorem (The Strong Law of Large Numbers). Let Xi,X 2 , ... be IID. If 

/i = E|Xi| < oo then X^-^ /i. 

A sequence X^ is asymptotically uniformly integrable if 

lim limsupE (|X^|/(|X^| > M)) = 0. 

M^oo n^oo 

p 

5.19 Theorem. If X^ — >b and X^ is asymptotieally uniformly integrable, 
then E(X^) ^ b. 

5.7.2 Proof of the Central Limit Theorem 

Recall that if X is a random variable, its moment generating function (mgf) 
is Lx{t) = Ee^^. Assume in what follows that the MGF is hnite in a neigh- 
borhood around t = 0. 

5.20 Lemma. Let Zi, Z 2 , ... be a sequence of random variables. Let ipn be the 
MGF of Zn. Let Z be another random variable and denote its MGF by L- If 
Lnif) ^ Lif) for all t in some open interval around 0, then Zn ^ E’. 
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PROOF OF THE CENTRAL LIMIT THEOREM. Let Yi = (X^ - /i)/cr. Then, 

Let V’(t) be the mgf of V. The mgf of V is (V’W)” 
and MGF of Zn is [ip{t/s/n)]'^ = Now '4>'{0) = E(Yi) = 0, '4>”{0) = 

E(y2) = v(Yi) = 1. So, 



Now, 



m = v^(0) + tV’'(0) + |c(0) + |c'(0) + --- 
= i + o + | + |c"(o) + --- 



C(0 = 



i’ — 

n 



1 + :^ + 



1 + 

pt"/2 



2n ' 

f + ^C'(0)+' 

n 



which is the MGF of a N(0,1). The result follows from the previous Theorem. 
In the last step we used the fact that if ^ a then 

dr] 






^ e 



5.8 Exercises 

1. Let Xi, . . . ,X^ be IID with hnite mean jjl = E(Xi) and finite variance 
cr^ = V(Xi). Let Xn be the sample mean and let 5"^ be the sample 
variance. 

(a) Show that IE(5'^) = cr^. 

(b) Show that cr^. Hint: Show that S‘^ = Cnn~^ Y^7=i ~ 

where ^ 1 and dn ^ 1. Apply the law of large numbers to n~^ 
and to Xn. Then use part (e) of Theorem 5.5. 

2. Let Xi,X 2 , ... be a sequence of random variables. Show that X^-^5 
if and only if 

lim E(X^) = b and lim V(X^) = 0. 

n^oo n^oo 




5.8 Exercises 
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3. Let Xi, . . . , Xn be IID and let jjl = E(Xi). Suppose that the variance is 

hnite. Show that /i. 

4. Let Xi, X 2 , . . . be a sequence of random variables such that 

p(x„ = -)=l-^ and P(X„ = n) = ^. 

\ n J 

Does Xn converge in probability? Does X^ converge in quadratic mean? 



5. Let Xi, . . . , Xn ^ Bernoulli(p). Prove that 






6. Suppose that the height of men has mean 68 inches and standard de- 
viation 2.6 inches. We draw 100 men at random. Find (approximately) 
the probability that the average height of men in our sample will be at 
least 68 inches. 

7. Let An = 1/n for n = 1, 2, — Let X^ ^ Poisson(An). 

(a) Show that X^^^O. 

(b) Let Yn = nXn. Show that 



8. Suppose we have a computer program consisting of n = 100 pages of 

code. Let Xi be the number of errors on the page of code. Suppose 
that the X's are Poisson with mean 1 and that they are independent. 
Let Y = total number of errors. Use the central limit 

theorem to approximate P(U < 90). 

9. Suppose that P(X = 1) = F(X = —1) = 1/2. Dehne 

_ J X with probability 1 — ^ 

^ \ with probability 

Does Xn converge to X in probability? Does X^ converge to X in dis- 
tribution? Does E(X — Xn)^ converge to 0? 

10. Let Z ^ X(0, 1). Let t > 0. Show that, for any /c > 0, 



P(|Z| >t)< 



^Z\^ 



Compare this to Mill’s inequality in Chapter 4. 
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11. Suppose that ^ iV(0,l/n) and let X be a random variable with 
distribution F{x) = 0 if x < 0 and F{x) = 1 if x > 0 . Does converge 
to X in probability? (Prove or disprove). Does X^ converge to X in 
distribution? (Prove or disprove). 

12. Let X, Xi, X2, X3, . . . be random variables that are positive and integer 
valued. Show that X^ X if and only if 

lim P(Xn = k)= P(X = k) 

n^oo 

for every integer k. 

13 . Let Zi,Z2,... be IID random variables with density /. Suppose that 
P(Z^ > 0) = 1 and that A = lima^^o f{x) > 0. Let 

Xn=n min{Zi, . . . 

Show that Xn ^ ^ where Z has an exponential distribution with mean 
l/A. 

2 

14 . Let Xi, . . . , Xn ^ Uniform( 0 , 1 ). Let = X^. Find the limiting distri- 
bution of Yn. 

15 . Let 

f ( X12 \ / Xi, A 

V ^21 ; ’ V ^22 yx^n ) 

be IID random vectors with mean jjl = (/ii,/i2) and variance E. Let 

-t n 1 ^ 

Vi = -Vxi„ X2 = -Vx2 i 

n n 

i=l i=l 

and dehne Y^ = X1/X2. Find the limiting distribution of Y^. 

16 . Construct an example where X^ X and Y^ ^ Y but X^ + Y^ does 
not converge in distribution to X + V. 
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6 

Models, Statistical Inference and 
Learning 



6.1 Introduction 

Statistical inference, or “learning” as it is called in computer science, is the 
process of using data to infer the distribution that generated the data. A 
typical statistical inference question is: 

Given a sample Xi, . . . , X^ F, how do we infer FI 

In some cases, we may want to infer only some feature of F such as its 
mean. 



6.2 Parametric and Nonparametric Models 

A statistical model is a set of distributions (or densities or regression 
functions). A parametric model is a set ^ that can be parameterized by a 
hnite number of parameters. For example, if we assume that the data come 
from a Normal distribution, then the model is 

d = = ^J_ exp ~ m)^| > M € M, cr > o| . (6.1) 

This is a two-parameter model. We have written the density as f{x; /i, a) to 
show that X is a value of the random variable whereas /i and a are parameters. 
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In general, a parametric model takes the form 

?= Infix', 9) : 0ee| (6.2) 

where 0 is an unknown parameter (or vector of parameters) that can take 
values in the parameter space 0. If is a vector but we are only interested in 
one component of 6>, we call the remaining parameters nuisance parameters. 
A nonparametric model is a set ^ that cannot be parameterized by a finite 
number of parameters. For example, S^all = {ah CDF's} is nonparametric. ^ 

6.1 Example (One-dimensional Parametric Estimation). Let Xi, . . ., be in- 
dependent Bernoulli (p) observations. The problem is to estimate the param- 
eter p. m 

6.2 Example (Two-dimensional Parametric Estimation). Suppose that Xi, . . ., 
Xn ^ F and we assume that the PDF / G where ^ is given in (6.1). In 
this case there are two parameters, p and a. The goal is to estimate the 
parameters from the data. If we are only interested in estimating /i, then jjl is 
the parameter of interest and cr is a nuisance parameter. ■ 

6.3 Example (Nonparametric estimation of the cdf). Let Xi, . . ., X^ be inde- 
pendent observations from a CDF F. The problem is to estimate F assuming 
only that F G S^all = {ah CDf's}. ■ 

6.4 Example (Nonparametric density estimation). Let Xi, . . . ,X^ be indepen- 
dent observations from a CDF F and let / = F' be the PDF. Suppose we want 
to estimate the PDF /. It is not possible to estimate / assuming only that 
F G S^all- We need to assume some smoothness on /. For example, we might 
assume that / G = S^dens PIS^sob where S^dens is the set of all probability 
density functions and 

i?SOB = |/ : j {f"{x)Ydx <00^. 

The class S^sob is called a Sobolev space; it is the set of functions that are 
not “too wiggly.” ■ 

6.5 Example (Nonparametric estimation of functionals). Let Xi, . . ., X^ ^ F. 

Suppose we want to estimate p = E(Xi) = J xdF{x) assuming only that 

^The distinction between parametric and nonparametric is more subtle than this but we don’t 
need a rigorous definition for our purposes. 
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jjL exists. The mean jjl may be thought of as a function of F: we can write 
/i = T{F) = J xdF{x). In general, any function of F is called a statis- 
tical functional. Other examples of functionals are the variance T{F) = 
Jx^dF(x) - {JxdF{x)f and the median T{F) = F~^ {I /2).m 

6.6 Example (Regression, prediction, and classification). Suppose we observe pairs 
of data (Xi,Ti), . . . Perhaps Xi is the blood pressure of subject i 

and Yi is how long they live. X is called a predictor or regressor or fea- 
ture or independent variable. Y is called the outcome or the response 
variable or the dependent variable. We call r{x) = E(T|X = x) the re- 
gression function. If we assume that r G where ^ is hnite dimensional — 
the set of straight lines for example — then we have a parametric regres- 
sion model. If we assume that r G where ^ is not hnite dimensional then 
we have a nonparametric regression model. The goal of predicting Y for 
a new patient based on their X value is called prediction. If Y is discrete 
(for example, live or die) then prediction is instead called classification. If 
our goal is to estimate the function r, then we call this regression or curve 
estimation. Regression models are sometimes written as 

Y = r{X)Fe (6.3) 

where E(e) = 0. We can always rewrite a regression model this way. To see 
this, dehne e = Y — r{X) and hence Y = Y r{X) — r{X) = r(X) + e. 
Moreover, E(e) = EE(e|X) = E(E(T - r(X))|X) = E(E(T|X) - r(X)) = 
E(r(X)-r(X)) = 0. ■ 

What’s Next? It is traditional in most introductory courses to start with 
parametric inference. Instead, we will start with nonparametric inference and 
then we will cover parametric inference. In some respects, nonparametric in- 
ference is easier to understand and is more useful than parametric inference. 

Frequentists and Bayesians. There are many approaches to statistical 
inference. The two dominant approaches are called frequentist inference 
and Bayesian inference. We’ll cover both but we will start with frequentist 
inference. We’ll postpone a discussion of the pros and cons of these two until 
later. 

Some Notation. If = {f{x; 0) : g 0} is a parametric model, we write 

¥e{X e A) = f^f(x; 6)dx and E6>(r(X)) = / r{x)f{x] 0)dx. The subscript 6 
indicates that the probability or expectation is with respect to f{x; 0); it does 
not mean we are averaging over 0. Similarly, we write Yq for the variance. 
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6.3 Fundamental Concepts in Inference 

Many inferential problems can be identified as being one of three types: es- 
timation, confidence sets, or hypothesis testing. We will treat all of these 
problems in detail in the rest of the book. Here, we give a brief introduction 
to the ideas. 

6.3.1 Point Estimation 

Point estimation refers to providing a single “best guess” of some quantity 
of interest. The quantity of interest could be a parameter in a parametric 
model, a CDF F, a probability density function /, a regression function r, or 
a prediction for a future value Y of some random variable. 

By convention, we denote a point estimate of 0 by 0 or 9^. Remember 
that 0 \s 3 fixed, unknown quantity. The estimate 0 depends on the 
data so is a random variable. 

More formally, let Xi, . . . , be n IID data points from some distribution 
F. A point estimator 0^ of a parameter 0 is some function of Xi, . . . , X^: 

= g{Xly • • • , Xn)- 

The bias of an estimator is defined by 

bias(^^) = ¥.e(6n) - 0. (6.4) 

We say that 6^ is unbiased if E(6>^) = 0. Unbiasedness used to receive much 
attention but these days is considered less important; many of the estimators 
we will use are biased. A reasonable requirement for an estimator is that it 
should converge to the true parameter value as we collect more and more 
data. This requirement is quantified by the following definition: 



6.7 Definition. A point estimator On of a parameter 0 is consistent if 



The distribution of On is called the sampling distribution. The standard 
deviation of On is called the standard error, denoted by se: 



se = se{0n) = y^{0n)- 



Often, the standard error depends on the unknown F. In those cases, se is 
an unknown quantity but we usually can estimate it. The estimated standard 
error is denoted by se. 
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6.8 Example. Let Xi, . . . , Xn ^ Bernoulli(p) and let Pn = Then 

lE(Pn) = = p so Pn is Unbiased. The standard error is se = 

\/^(Pn) = \/p(l — p)/n. The estimated standard error is ^ = ^/p{l — p)/n. 



The quality of a point estimate is sometimes assessed by the mean squared 
error, or MSE dehned by 

MSE = E6i(0„ - 6')^. (6.6) 

Keep in mind that E^)-) refers to expectation with respect to the distribution 

n 

/(xi,...,x„; e) = II /(x*; e) 

that generated the data. It does not mean we are averaging over a distribution 
for 6. 

6.9 Theorem. The mse can be written as 

MSE = bias^(d„) + YeiOn). (6.7) 

Prooe. Let On = Eo{6n)- Then 

^e{0n-0)‘^ = ¥.e{6n -^n - 0)‘^ 

= ¥,Q(0n — 0n)‘^ + — 0)¥,Q[0n ~ On) + 

= {^n-ef +^e{0n-~enf 
= bias2(0„)+V(d„) 

where we have used the fact that ^e{0n — On) = 6>n — = 0- ■ 

6.10 Theorem. If bias ^ 0 and se ^ 0 as n ^ oo then On is consistent, that 
is, On^O. 

Prooe. If bias ^ 0 and se ^ 0 then, by Theorem 6.9, MSE ^ 0. It 
follows that On-^^0. (Recall Dehnition 5.2.) The result follows from part (b) 
of Theorem 5.4. ■ 

6.11 Example. Returning to the coin flipping example, we have that 'Ep{pn) = 
p so the bias = p — p = 0 and se = y/p{l — p)/n 0. Hence, Pn~^P^ that is, 
Pn is a consistent estimator. ■ 

Many of the estimators we will encounter will turn out to have, approxi- 
mately, a Normal distribution. 
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6.12 Definition. An estimator is asymptotically Normal if 



On - 0 

se 



A^(0,1). 



(6.8) 



6.3.2 Confidence Sets 

A 1 — a confidence interval for a parameter 6 is an interval Cn = (a, h) 
where a = a(Xi, . . . , Xn) and b = b{Xi ^ . . . , Xn) are functions of the data 
such that 

^e{0 eCn)>l-a, for all (9 G 0. (6.9) 

In words, (a, b) traps 0 with probability 1 — a. We call 1 — a the coverage of 
the conhdence interval. 

Warning! is random and 6 is hxed. 

Commonly, people use 95 percent conhdence intervals, which corresponds 
to choosing a = 0.05. If is a vector then we use a confidence set (such as 
a sphere or an ellipse) instead of an interval. 

Warning! There is much confusion about how to interpret a conhdence 
interval. A conhdence interval is not a probability statement about 6 since 
is a hxed quantity, not a random variable. Some texts interpret conhdence 
intervals as follows: if I repeat the experiment over and over, the interval will 
contain the parameter 95 percent of the time. This is correct but useless since 
we rarely repeat the same experiment over and over. A better interpretation 
is this: 

On day 1 , you collect data and construct a 95 percent confidence 
interval for a parameter 6i. On day 2 , you collect new data and con- 
struct a 95 percent confidence interval for an unrelated parameter O 2 . 

On day 3, you collect new data and construct a 95 percent confi- 
dence interval for an unrelated parameter 0^. You continue this way 
constructing confidence intervals for a sequence of unrelated param- 
eters 6 ^ 1 , 6 ^ 2 , • • • Then 95 percent of your intervals will trap the true 
parameter value. There is no need to introduce the idea of repeating 
the same experiment over and over. 

6.13 Example. Every day, newspapers report opinion polls. For example, they 
might say that “83 percent of the population favor arming pilots with guns.” 
Usually, you will see a statement like “this poll is accurate to within 4 points 
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95 percent of the time.” They are saying that 83 ±4 is a 95 percent conhdence 
interval for the true but unknown proportion p of people who favor arming 
pilots with guns. If you form a conhdence interval this way every day for the 
rest of your life, 95 percent of your intervals will contain the true parameter. 
This is true even though you are estimating a different quantity (a different 
poll question) every day. ■ 

6.14 Example. The fact that a conhdence interval is not a probability state- 
ment about 6 is confusing. Consider this example from Berger and Wolpert 
(1984). Let 0 he 8i hxed, known real number and let Xi,X 2 be independent 
random variables such that P(X^ = 1) = P(X^ = —1) = 1/2. Now dehne 
Yi = 6 ^ Xi and suppose that you only observe Yi and Y 2 . Dehne the follow- 
ing “conhdence interval” which actually only contains one point: 

{ v - i } hy, = y2 

\ {(Y,+Y2)/2} iiY,^Y2. 



You can check that, no matter what 0 is, we have G C) = 3/4 so this 
is a 75 percent conhdence interval. Suppose we now do the experiment and 
we get Yi = 15 and Y 2 = 17. Then our 75 percent conhdence interval is {16}. 
However, we are certain that 0 = 16. If you wanted to make a probability 
statement about 6 you would probably say that G (7|Yi, Y 2 ) = L There is 
nothing wrong with saying that {16} is a 75 percent conhdence interval. But 
is it not a probability statement about 0. m 



In Chapter 11 we will discuss Bayesian methods in which we treat 6 as if it 
were a random variable and we do make probability statements about 6, In 
particular, we will make statements like “the probability that 0 is in given 
the data, is 95 percent.” However, these Bayesian intervals refer to degree- 
of-belief probabilities. These Bayesian intervals will not, in general, trap the 
parameter 95 percent of the time. 

6.15 Example. In the coin hipping setting, let = {Pn ~ Pn T^n) where 
= log(2/(a)/(2n). From Hoeffding’s inequality (4.4) it follows that 



IP(P eCn)>l-a 



for every p. Hence, is a 1 — a conhdence interval. ■ 

As mentioned earlier, point estimators often have a limiting Normal dis- 
tribution, meaning that equation (6.8) holds, that is, 6^ ~ In this 

case we can construct (approximate) conhdence intervals as follows. 
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6.16 Theorem (Normal-based Confidence Interval). Suppose that On ~ N{0^se^) 
Let be the CDF of a standard Normal and let — (a/2)), that 

is, F(Z > Zo,/2) = o^/2 and IP(— ^a /2 < Z < Zq,/^) = I — a where Z ^ N(0, 1). 
Let 

Cn = (^n — ^n + ^a/2Se). (6.10) 

Then 

^e(0 eCn)^l-a. (6.11) 

Proof. Let Zn = (On — 0)/^. By assumption Zn Z where Z ^ A^(0, 1). 
Hence, 



^ Cn) — P6> — ^oi !2 se < + Zq ,/2 sej 

^ ( On-0 \ 

= f -Z^/2 < < ^a/2 1 

^ P ( — ^a/2 < Z < Zo^/ 2 ) 

= 1 — <a. ■ 



For 95 percent confidence intervals, a = 0.05 and Zq ,/2 = 1-96 ~ 2 leading 
to the approximate 95 percent confidence interval On F2^. 

6.17 Example. Let Xi,...,X^ ^ Bernoulli(p) and let Pn = 

Then Y(pn) = n~‘^ SiLi ^(^0 = SiLi ~ P) — Ti~‘^np(l — p) = p(l — 
p)/n. Hence, se = Y^p(l — p)/n and se = \/pn(^ — Pn)/R- By the Central 
Limit Theorem, pn ~ N(p,^^). Therefore, an approximate 1 — ce confidence 
interval is 

Pn ± ^a/2^ = Pn ± ^a/2 

Compare this with the confidence interval in example 6.15. The Normal-based 
interval is shorter but it only has approximately (large sample) correct cover- 
age. ■ 

6.3.3 Hypothesis Testing 

In hypothesis testing, we start with some default theory — called a null 
hypothesis — and we ask if the data provide sufficient evidence to reject the 
theory. If not we retain the null hypothesis. ^ 



Pn(l -Pn) 



n 



^The term “retaining the null hypothesis” is due to Chris Genovese. Other terminology is 
“accepting the null” or “failing to reject the null.” 
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6.18 Example (Testing if a Coin is Fair). Let 

Xi, . . . , Xn ^ Bernoulli(p) 

be n independent coin flips. Suppose we want to test if the coin is fair. Let Hq 
denote the hypothesis that the coin is fair and let Hi denote the hypothesis 
that the coin is not fair. Hq is called the null hypothesis and Hi is called 
the alternative hypothesis. We can write the hypotheses as 

Hq : p = 1/2 versus Hi : p ^ 1 / 2 . 

It seems reasonable to reject Hq if T = \pn — (1/2)1 is large. When we discuss 
hypothesis testing in detail, we will be more precise about how large T should 
be to reject Hq. m 

6.4 Bibliographic Remarks 

Statistical inference is covered in many texts. Elementary texts include DeG- 
root and Schervish (2002) and Larsen and Marx (1986). At the intermediate 
level I recommend Casella and Berger (2002), Bickel and Doksum (2000), and 
Rice (1995). At the advanced level, Cox and Hinkley (2000), Lehmann and 
Casella (1998), Lehmann (1986), and van der Vaart (1998). 

6.5 Appendix 

Our dehnition of conhdence interval requires that ¥ 0 {O G Cn) > 1 — a 
for all 0 G 0. A pointwise asymptotic conhdence interval requires that 
liminf^^oo iP6>(^ ^ > 1 — for all 6^ G 0. A uniform asymptotic con- 
hdence interval requires that lim inf^^oo ^ Cn) > 1 — a. The 

approximate Normal-based interval is a pointwise asymptotic conhdence in- 
terval. 



6.6 Exercises 

1. Let Xi, . . . ,Xn ~ Poisson(A) and let A = n~^ EGi A- Find the bias, 
se, and MSE of this estimator. 

2. Let Xi, . . . , Xn ^ Uniform(0, 6 ) and let 0 = maxjXi, . . . , X^}. Find the 
bias, se, and MSE of this estimator. 
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3. Let Xi, . . . ^ Uniform(0, 6 ) and let 6 = 2X^. Find the bias, se, and 

MSB of this estimator. 




7 

Estimating the CDF and Statistical 
Functionals 



The first inference problem we will consider is nonparametric estimation of the 
CDF F. Then we will estimate statistical functionals, which are functions of 
CDF, such as the mean, the variance, and the correlation. The nonparametric 
method for estimating functionals is called the plug-in method. 



7.1 The Empirical Distribution Function 

Let Xi, . . . ^ F be an IID sample where F is a distribution function on 

the real line. We will estimate F with the empirical distribution function, 
which is defined as follows. 



7.1 Definition. Thi 


3 empirical distribution function F^ i 


s the CDF 


that puts mass 1/n 


at each data point Xi. Formally, 






n 


(7.1) 


where 


. \ f 1 if < X 

Hx, < I) = 0 ,t 
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0.0 0.5 1.0 1.5 



FIGURE 7.1. Nerve data. Each vertical line represents one data point. The solid 
line is the empirical distribution function. The lines above and below the middle 
line are a 95 percent conhdence band. 

7.2 Example (Nerve Data). Cox and Lewis (1966) reported 799 waiting times 
between successive pulses along a nerve fiber. Figure 7.1 shows the empirical 
CDF Fn. The data points are shown as small vertical fines at the bottom of 
the plot. Suppose we want to estimate the fraction of waiting times between 
.4 and .6 seconds. The estimate is F^(.6) — Fn{A) = .93 — .84 = .09. ■ 




^More precisely, sup^ \Fn{x) — F{x)\ converges to 0 almost surely. 
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7.5 Theorem (The Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality). LetXi, 
Xn ^ F . Then, for any e > 0, 



From the DKW inequality, we can construct a confidence set as follows: 



A Nonparametric 1 — a Confidence Band for F 

Define, 

L{x) = max{F„(a;) - e„, 0} 

U{x) = min{F„(a;) + e„, 1} 

where = 

It follows from (7.2) that for any F, 

F^L{x) < F{x) < U{x) for all x j > 1 — a. (7.3) 




7.6 Example. The dashed lines in Figure 7.1 give a 95 percent confidence 
band using e„ = ffFgiX) = -048. ■ 

7.2 Statistical Functionals 

A statistical functional T{F) is any function of F. Examples are the mean 
ju = f xdF(x), the variance = f (x — fiY dF{x) and the median m = 
F-\l/2). 

7.7 Definition. The plug-in estimator of 0 = T{F) is defined by 

On = T{K). 



In other words, just plug in F^ for the unknown F. 
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The reason T{F) = f r(x)dF(x) is called a linear functional is because T 
satisfies 

T{aF + bG) = aT{F) + hT{G), 

hence T is linear in its arguments. Recall that / r{x)dF{x) is defined to be 
/ r{x)f{x)dx in the continuous case and discrete. The 

empirical cdf Fn{x) is discrete, putting mass 1/n at each X^. Hence, if T{F) = 
f r{x)dF(x) is a linear functional then we have: 

7.9 Theorem. The plug-in estimator for linear funetional 
T{F) = fr(x)dF(x) is: 

T(F„) = J r{x)dFr,{x) = -J2r{Xi). (7.4) 



Sometimes we can find the estimated standard error se of T(F^) by doing 
some calculations. However, in other cases it is not obvious how to estimate 
the standard error. In the next chapter, we will discuss a general method for 
finding For now, let us just assume that somehow we can find 
In many cases, it turns out that 

T(F„) «iV(T(F),se2). (7.5) 

By equation (6.11), an approximate 1 — a confidence interval for T{F) is then 

T{F„) ± Zaj2 (7.6) 

We will call this the Normal- based interval. For a 95 percent confidence 
interval, = ^. 05/2 = 1-96 ~ 2 so the interval is 

T{Fn) ±2^. 

7.10 Example (The mean). Let /i = T{F) = f xdF(x). The plug-in estima- 
tor is ju = f X dF„ (x) = X^. The standard error is se = '^V(X^) = ajy/n. If 
a denotes an estimate of cr, then the estimated standard error is alyTi. (In 
the next example, we shall see how to estimate a.) A Normal-based confidence 
interval for p is X^ ± Zck /2 ■ 

7.11 Example (The Variance). Let = T(F) = V(X) = f x^dF(x)-(f xdF(x)f. 
The plug-in estimator is 
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1 

n 






n 

n ^ ^ 



i=l 



1 

n 



Y,{Xi - 

i=l 



2 



Another reasonable estimator of is the sample variance 



i=l 

In practice, there is little difference between and S‘^ and you can use either 
one. Returning to the last example, we now see that the estimated standard 
error of the estimate of the mean is ^ = a j ^Jn. ■ 



7.12 Example (The Skewness). Let fi and denote the mean and variance 
of a random variable X. The skewness is defined to be 

E(X — /i)^ f(^~ fi)^dF{x) 

{ J(x — 

The skewness measures the lack of symmetry of a distribution. To find the 
plug-in estimate, first recall that = n~^ Xi and a‘^ = n~^ “ /^)^- 

The plug-in estimate of n is 

J{x - ^ifdFnix) 

^ r -- 1 3/2 ^3 ■ ■ 

{/(3^ - X)'^dFn{x)^ 

7.13 Example (Correlation). Let Z = (X, T) and let p = T{F) = E{X — 
dx)(y — dv) / {(^x(^y) denote the correlation between X and T, where F{x^y) 
is bivariate. We can write 



T{F) = a(Ti(F), T2(F), T3(F), T4(F), T5(F)) 



where 



T,(F) = JxdF(z), T 2 (F)= fydF(z), Ts(F) = J xydF(z), 
T4(F) = Jx^ dF{z), Ts(F) = / dF{z), 



and 



n(ti , . . . , ^ 5 ) 



ts — tih 

y (/4 - - ti) 



Replace F with F^ in T\{F), . . . , T^{F), and take 



p = a(Ti(F„),T2(F„),T3(F„),T4(F„),T5(F„)). 




102 



7. Estimating the CDF and Statistical Functionals 




7.14 Example (Quantiles). Let F be strictly increasing with density /. For 
0 < p < 1, the quantile is defined by T{F) = F~^{p). The estimate if 
T{F) is F~^{p). We have to be a bit careful since F^ is not invertible. To 
avoid ambiguity we define 

: ^nix) > p}. 

We call T{Fn) = F~^{p) the p^^ sample quantile. ■ 



Only in the first example did we compute a standard error or a confidence 
interval. How shall we handle the other examples? When we discuss parametric 
methods, we will develop formulas for standard errors and confidence intervals. 
But in our nonparametric setting we need something else. In the next chapter, 
we will introduce the bootstrap for getting standard errors and confidence 
intervals. 



7.15 Example (Plasma Cholesterol). Figure 7.2 shows histograms for plasma 
cholesterol (in mg/dl) for 371 patients with chest pain (Scott et al. (1978)). 
The histograms show the percentage of patients in 10 bins. The first histogram 
is for 51 patients who had no evidence of heart disease while the second 
histogram is for 320 patients who had narrowing of the arteries. Is the mean 
cholesterol different in the two groups? Let us regard these data as samples 
from two distributions Fi and F 2 . Let /ii = / xdFi{x) and jU 2 = f xdF 2 (x) 
denote the means of the two populations. The plug-in estimates are = 
/ xdFn^i{x) = = 195.27 and P 2 = f xdFn^ 2 {x) = = 216.19. Recall 

that the standard error of the sample mean p = ^ 

which we estimate by 
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For the two groups this yields ^(yUi) = 5.0 and ^(yU 2 ) = 2.4. Approximate 95 
percent conhdence intervals for jii and /i 2 are /ii ± 2^(/ii) = (185,205) and 
/i2±2s^(/i2) = (211,221). 

Now, consider the functional 0 = T{F 2 ) — T{Fi) whose plug-in estimate is 
= yU 2 — yUi = 216.19 — 195.27 = 20.92. The standard error of 9 is 

se = a/V(/I2 - V-i) = \A^(^F+T 7 l) = \/ (se(/ii ))2 + (se(/i2))2 

and we estimate this by 

^ = \/{se{'ili)Y + {se{jl2)Y = 5.55. 

An approximate 95 percent conhdence interval for 9 is 9F2 ^{9n) = (9.8, 32.0). 
This suggests that cholesterol is higher among those with narrowed arteries. 
We should not jump to the conclusion (from these data) that cholesterol causes 
heart disease. The leap from statistical evidence to causation is very subtle 
and is discussed in Chapter 16. ■ 




100 150 200 250 300 350 400 

plasma cholesterol for patients without heart disease 




100 150 200 250 300 350 400 

plasma cholesterol for patients with heart disease 



FIGURE 7.2. Plasma cholesterol for 51 patients with no heart disease and 320 
patients with narrowing of the arteries. 
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7.3 Bibliographic Remarks 

The Glivenko-Cantelli theorem is the tip of the iceberg. The theory of dis- 
tribution functions is a special case of what are called empirical processes 
which underlie much of modern statistical theory. Some references on empiri- 
cal processes are Shorack and Wellner (1986) and van der Vaart and Wellner 
(1996). 

7.4 Exercises 

1. Prove Theorem 7.3. 

2. Let Xi, . . . , Xn ^ Bernoulli(p) and let li, . . . , ^ Bernoulli(g'). Find 

the plug-in estimator and estimated standard error for p. Find an ap- 
proximate 90 percent confidence interval for p. Find the plug-in esti- 
mator and estimated standard error for p — g. Find an approximate 90 
percent confidence interval for p — g. 

3. (Computer Experiment.) Generate 100 observations from a N(0,1) dis- 
tribution. Gompute a 95 percent confidence band for the CDF F (as 
described in the appendix). Repeat this 1000 times and see how often 
the confidence band contains the true distribution function. Repeat us- 
ing data from a Gauchy distribution. 

4. Let Xi, . . . ,X^ ^ F and let Fn{x) be the empirical distribution func- 
tion. For a fixed x, use the central limit theorem to find the limiting 
distribution of Fn{x). 

5. Let X and y be two distinct points. Find Cov{Fn{x)/Fn[y)). 

6. Let Xi, . . . ,X^ ^ F and let F be the empirical distribution function. 
Let a < 6 be fixed numbers and define 6 = T{F) = F(h) — F{a). Fet 
0 = T{Fn) = Fn{b) — Fn{a). Find the estimated standard error of 0. 
Find an expression for an approximate 1 — a confidence interval for 0. 

7. Data on the magnitudes of earthquakes near Fiji are available on the 
website for this book. Estimate the CDF F{x). Gompute and plot a 95 
percent confidence envelope for F (as described in the appendix). Find 
an approximate 95 percent confidence interval for X(4.9) — T(4.3). 
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8. Get the data on eruption times and waiting times between eruptions of 
the Old Faithful geyser from the website. Estimate the mean waiting 
time and give a standard error for the estimate. Also, give a 90 percent 
conhdence interval for the mean waiting time. Now estimate the median 
waiting time. In the next chapter we will see how to get the standard 
error for the median. 

9. 100 people are given a standard antibiotic to treat an infection and 
another 100 are given a new antibiotic. In the hrst group, 90 people 
recover; in the second group, 85 people recover. Let pi be the probability 
of recovery under the standard treatment and let p 2 be the probability of 
recovery under the new treatment. We are interested in estimating 6 = 
Pi — P 2 - Provide an estimate, standard error, an 80 percent conhdence 
interval, and a 95 percent conhdence interval for 0. 

10. In 1975, an experiment was conducted to see if cloud seeding produced 
rainfall. 26 clouds were seeded with silver nitrate and 26 were not. The 
decision to seed or not was made at random. Get the data from 

http://lib.stat.cmu.edu/DASL/Stories/GloudSeeding.html 

Let 0 be the difference in the mean precipitation from the two groups. 
Estimate 6. Estimate the standard error of the estimate and produce a 
95 percent conhdence interval. 




The Bootstrap 



The bootstrap is a method for estimating standard errors and computing 
confidence intervals. Let = g{Xi^ . . . , X^) be a statistic, that is, is any 
function of the data. Suppose we want to know Vi?(T^), the variance of T^. 
We have written to emphasize that the variance usually depends on the 
unknown distribution function F. For example, if = X^ then Vf(T^) = 
cr^/n where = f(x — g)‘^dF{x) and g = f xdF{x). Thus the variance of 
is a function of F. The bootstrap idea has two steps: 

Step 1: Estimate Vf(T^) with Vp (Th). 

Step 2: Approximate Yp (T^) using simulation. 

For Tn = Xn, we have for Step 1 that Yp (T^) = where a‘^ = 

Xn)- In this case. Step 1 is enough. However, in more complicated cases we 
cannot write down a simple formula for Yp (T^) which is why we need Step 
2. Before proceeding, let us discuss the idea of simulation. 
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8.1 Simulation 



Suppose we draw an IID sample li, . . . , from a distribution G. By the law 
of large numbers, 



Y 



n 



1 

B 




ydG{y)=ny) 



as 5 ^ oo. So if we draw a large sample from G, we can use the sample 
mean Yn to approximate E(Y). In a simulation, we can make B as large as 
we like, in which case, the difference between Y^ and E(Y) is negligible. More 
generally, if h is any function with finite mean then 






h{y)dG{y) = E(/i(Y)) 



as B oo. In particular, 

B 



■ ^ J=1 



J=1 



J=1 



Jy^dFiy)- (jydF{y)\ =Y{Y). 



Hence, we can use the sample variance of the simulated values to approximate 
Y{Y). 



8.2 Bootstrap Variance Estimation 

According to what we just learned, we can approximate Yp (T^) by simula- 
tion. Now Vp (Tn) means “the variance of if the distribution of the data 
is Fn-” How can we simulate from the distribution of when the data are 
assumed to have distribution Fk? The answer is to simulate . . . , X* from 
Fn and then compute F* = . . . , X*). This constitutes one draw from 

the distribution of T^. The idea is illustrated in the following diagram: 

Real world F ^ Xi, . . . , X^ ^ = g{Xi , . . . , X^) 

Bootstrap world F^ XJ", . . . , X* = ^(XJ", . . . , X*) 

How do we simulate XJ", . . . , X* from F^? Notice that F^ puts mass 1/n at 
each data point Xi, . . . , X^. Therefore, 
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drawing an observation from is equivalent to drawing 
one point at random from the original data set. 

Thus, to simulate XJ", . . . ,X* ^ it suffices to draw n observations with 
replacement from Xi, . . . , X^. Here is a summary: 



Bootstrap Variance Estimation 

1. Draw Xi, . . . ,X* ^ Th. 

2. Compute T* = g{Xl, X*). 

3. Repeat steps 1 and 2, B times, to get . . . , 

4. Let 

B / B \ ^ 

w = • (8-1) 



8.1 Example. The following pseudocode shows how to use the bootstrap to 
estimate the standard error of the median. 

Bootstrap for The Median 

Given data X = (X(l), X(n)): 

T <- median(X) 

Tboot <- vector of length B 
for(i in 1:B){ 

Xstar <- sample of size n from X (with replacement) 

Tboot [i] <- median(Xstar) 

} 

se <- sqrt (variance (Tboot) ) 

The following schematic diagram will remind you that we are using two 
approximations : 



not so small 




small 

(Tn) ~ '^boot' 



8.2 Example. Consider the nerve data. Let = T(T) = J(x— /i)^dF(x)/cr^ be 
the skewness. The skewness is a measure of asymmetry. A Normal distribution. 






no 
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for example, has skewness 0. The plug-in estimate of the skewness is 









1.76. 



To estimate the standard error with the bootstrap we follow the same steps 
as with the median example except we compute the skewness from each 
bootstrap sample. When applied to the nerve data, the bootstrap, based on 
B = 1,000 replications, yields a standard error for the estimated skewness of 

.16. ■ 



8.3 Bootstrap Confidence Intervals 

There are several ways to construct bootstrap confidence intervals. Here we 
discuss three methods. 

Method 1: The Normal Interval. The simplest method is the Normal interval 

Tn i ^a/2 reboot 

where ^boot = y^noot is the bootstrap estimate of the standard error. This 
interval is not accurate unless the distribution of is close to Normal. 

Method 2: Pivotal Intervals. Let 9 = T{F) and 9n = T{Fn) and define the 
pivot Rn = On — 9. Let 9^ . ,9^ ^ denote bootstrap replications of 9n- Let 

H{r) denote the CDF of the pivot: 

H{r)=FF{Rn<r). (8.3) 



Define = (a, b) where 

a = (l-|) and 6 = (|) . (8.4) 

It follows that 



= na-0n<0-en<b-9n) 
= n0n-b<en-e<9n-a) 
= H9n -b<Rn<9n-a) 



= H{9n-a)-H{9n-b) 

- -"(-■(!)) 

OL OL 



P(a <9<h) 
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Hence, is an exact 1 — a confidence interval for 0. Unfortunately, a and b 
depend on the unknown distribution H but we can form a bootstrap estimate 
of H: 

1 ^ 

H{r) = -Y,m*nfi<r) (8.5) 

^ 6=1 

where i?* ^ Let denote the f3 sample quantile of • • • , Ru,b) 

and let Op denote the (3 sample quantile of Note that = 

Op — On. It follows that an approximate 1 — a conhdence interval is Cn = (u, b) 
where 

= On — H ^ (^1 — — ^ = On — '^i-a/2 ~ ~ ^l-a/2 

b = =20„-C/2- 

In summary, the 1 — a bootstrap pivotal confidence interval is 

C„ = (Wn - - C/ 2 ) • (8-6) 

8.3 Theorem. Under weak eonditions on T{F), 

¥F(T{F)eCn)^l-a 
as n ^ 00 , where Cn is given in (8.6). 

Method 3: Percentile Intervals. The bootstrap percentile interval is de- 

hned by 

( n* n* \ 

— \^a/2^ ^l-a/2 ) * 

The justihcation for this interval is given in the appendix. 

8.4 Example. For estimating the skewness of the nerve data, here are the 
various conhdence intervals. 

Method 95% Interval 

Normal (1-44, 2.09) 

Pivotal (1-48, 2.11) 

Percentile (1.42, 2.03) 

All these conhdence intervals are approximate. The probability that T{F) 
is in the interval is not exactly 1 — a. All three intervals have the same level 
of accuracy. There are more accurate bootstrap conhdence intervals but they 
are more complicated and we will not discuss them here. 
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8.5 Example (The Plasma Cholesterol Data). Let us return to the cholesterol 
data. Suppose we are interested in the difference of the medians. Pseudocode 
for the bootstrap analysis is as follows: 

xl <- first sample 
x2 <- second sample 
nl <- length(xl) 
n2 <- length (x2) 

th.hat <- median (x2) - median (xl) 

B <- 1000 

Tboot <- vector of length B 
for(i in 1:B){ 

xxl <- sample of size nl with replacement from xl 
xx2 <- sample of size n2 with replacement from x2 
Tboot [i] <- median(xx2) - median(xxl) 

} 

se <- sqrt (variance (Tboot) ) 

Normal <- (th.hat - 2*se, th.hat + 2*se) 

percentile <- (quantile (Tboot ,. 025) , quantile (Tboot ,. 975) ) 

pivotal <- ( 2*th. hat-quantile (Tboot ,. 975) , 

2*th . hat-quantile (Tboot , . 025) ) 

The point estimate is 18.5, the bootstrap standard error is 7.42 and the re- 
sulting approximate 95 percent confidence intervals are as follows: 

Method 95% Interval 

Normal (3.7, 33.3) 

Pivotal (5.0, 34.0) 

Percentile (5.0, 33.3) 

Since these intervals exclude 0, it appears that the second group has higher 
cholesterol although there is considerable uncertainty about how much higher 
as reflected in the width of the intervals. ■ 

The next two examples are based on small sample sizes. In practice, sta- 
tistical methods based on very small sample sizes might not be reliable. We 
include the examples for their pedagogical value but we do want to sound a 
note of caution about interpreting the results with some skepticism. 

8.6 Example. Here is an example that was one of the first used to illustrate 
the bootstrap by Bradley Efron, the inventor of the bootstrap. The data are 
LSAT scores (for entrance to law school) and GPA. 
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LSAT 576 


635 


558 


578 


666 


580 


555 661 


651 


605 


653 


575 


545 


572 


594 



GPA 


3.39 


3.30 


2.81 


3.03 


3.44 


3.07 


3.00 3.43 




3.36 


3.13 


3.12 


2.74 


2.76 


2.88 


3.96 



Each data point is of the form Xi = Zi) where Yi = LSAT^ and Zi = 
GPA^. The law school is interested in the correlation 

^ ^ I f{y - I^y)(z - iiz)dF{y,z) 

\J Ry - yvYdFiy) f(z - yzfdF{z) 

The plug-in estimate is the sample correlation 

^ E»(r»-F)(z,-z) 

The estimated correlation is = .776. The bootstrap based on B = 1000 
gives ^ = .137. Figure 8.1 shows the data and a histogram of the bootstrap 
replications 6^^, . . . , 6^^. This histogram is an approximation to the sampling 
distribution of 0. The Normal-based 95 percent conhdence interval is .78 ± 
2^ = (.51, 1.00) while the percentile interval is (.46, .96). In large samples, the 
two methods will show closer agreement. ■ 



8.7 Example. This example is from Efron and Tibshirani (1993). When drug 
companies introduce new medications, they are sometimes required to show 
bioequivalence. This means that the new drug is not substantially different 
than the current treatment. Here are data on eight subjects who used medi- 
cal patches to infuse a hormone into the blood. Each subject received three 
treatments: placebo, old-patch, new-patch. 



subject 


placebo 


old 


new 


old — placebo 


new — old 


1 


9243 


17649 


16449 


8406 


-1200 


2 


9671 


12013 


14614 


2342 


2601 


3 


11792 


19979 


17274 


8187 


-2705 


4 


13357 


21816 


23798 


8459 


1982 


5 


9055 


13850 


12560 


4795 


-1290 


6 


6290 


9806 


10157 


3516 


351 


7 


12412 


17208 


16570 


4796 


-638 


8 


18806 


29044 


26325 


10238 


-2719 
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560 580 600 620 640 660 



LSAT 




0.2 0.4 0.6 0.8 1.0 



Bootstrap Samples 



FIGURE 8.1. Law school data. The top panel shows the raw data. The bottom panel 
is a histogram of the correlations computed from each bootstrap sample. 



Let Z = old — placebo and Y = new — old. The Food and Drug Adminis- 
tration (FDA) requirement for bioequivalence is that \6\ < .20 where 

The plug-in estimate of 0 is 




-452.3 

6342 



-0.0713. 



The bootstrap standard error is ^ = 0.105. To answer the bioequivalence 
question, we compute a confidence interval. From B = 1000 bootstrap repli- 
cations we get the 95 percent interval (-0.24,0.15). This is not quite contained 
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in (-0.20,0.20) so at the 95 percent level we have not demonstrated bioequiv- 
alence. Figure 8.2 shows the histogram of the bootstrap values. ■ 




Bootstrap Samples 



FIGURE 8.2. Patch data. 



8.4 Bibliographic Remarks 

The bootstrap was invented by Efron (1979). There are several books on these 
topics including Efron and Tibshirani (1993), Davison and Hinkley (1997), 
Hall (1992) and Shao and Tu (1995). Also, see section 3.6 of van der Vaart 
and Wellner (1996). 



8.5 Appendix 

8. 5. 1 The Jackknife 

There is another method for computing standard errors called the jackknife, 
due to Quenouille (1949). It is less computationally expensive than the boot- 
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strap but is less general. Let = T(Xi, . . . , X^) be a statistic and de- 
note the statistic with the observation removed. Let = n~^ Y^7=i ^(-d* 
The jackknife estimate of var(T^) is 

— 1 ^ 

'^jack = - - Tn)‘^ 

and the jackknife estimate of the standard error is sejack = y/'^jack- Under 
suitable conditions on T, it can be shown that ujack consistently estimates 
var(T^) in the sense that Ujack/var(Tn) — ^ 1. However, unlike the bootstrap, 
the jackknife does not produce consistent estimates of the standard error of 
sample quantiles. 



8.5.2 Justification For The Percentile Interval 

Suppose there exists a monotone transformation U = m(T) such that U ^ 
N{(j)^c^) where <j) = m(6). We do not suppose we know the transformation, 
only that one exists. Let Let be the (3 sample quantile of 

the Ufis. Since a monotone transformation preserves quantiles, we have that 
</2 = ^K/2)- Also, since U ^ iV((^, c^), the a/2 quantile of U is (^ — Zf^j^e. 
Hence n* ^2 — 4^ ~ ^a/ 2 C. Similarly, = 4 3- Therefore, 



1/2 < 0 < 



P(m(^:/2) < m(^) < m(^t_,/2)) 
P«/2 < 0 < nt_,/2) 

¥{U - CZ^I2 < (^ < u + CZq,/2) 
u -4 

H-Za/2 < < Za/ 2 ) 

1 - ce. 



An exact normalizing transformation will rarely exist but there may exist 
approximate normalizing transformations. 



8.6 Exercises 

1. Consider the data in Example 8.6. Find the plug-in estimate of the 
correlation coefficient. Estimate the standard error using the bootstrap. 
Find a 95 percent conhdence interval using the Normal, pivotal, and 
percentile methods. 
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2 . (Computer Experiment.) Conduct a simulation to compare the various 

bootstrap confidence interval methods. Let n = 50 and let T{F) = 
f(x — iiYdF(x)/cF^ be the skewness. Draw ^ A^(0, 1) and 

set Xi = i = 1 , . . . ,n. Construct the three types of bootstrap 95 
percent intervals for T{F) from the data Xi, . . . , X^. Repeat this whole 
thing many times and estimate the true coverage of the three intervals. 

3. Let 

Xi, . . . , Xn ^ ts 

where n = 25. Let 0 = T{F) = (^^.75 — g'.25)/l-34 where qp denotes the 
quantile. Do a simulation to compare the coverage and length of the 
following conhdence intervals for 0: (i) Normal interval with standard 
error from the bootstrap, (ii) bootstrap percentile interval, and (iii) 
pivotal bootstrap interval. 

4. Let Xi, . . . ,X^ be distinct observations (no ties). Show that there are 

rr) 



distinct bootstrap samples. 

Hint: Imagine putting n balls into n buckets. 

5. Let Xi, . . . , Xn be distinct observations (no ties). Let X^, . . . , X* denote 

a bootstrap sample and let X* = Find: E(X* |Xi, . . . , X^), 

V(V:|Xi,...,X„),E(V:) and V®. 

6 . (Computer Experiment.) Let Xi, ...,X^ Normal(/i, 1). Let 0 = and let 

0 = . Create a data set (using /i = 5) consisting of n =100 observa- 

tions. 

(a) Use the bootstrap to get the se and 95 percent conhdence interval 
for 0. 

(b) Plot a histogram of the bootstrap replications. This is an estimate 
of the distribution of 0. Compare this to the true sampling distribution 
of 

7. Let Xi, ..., Xn ^ Uniform( 0 , 0). Let 0 = Xj^ax = max{Xi, ..., X^}. Gen- 
erate a data set of size 50 with 0 = 1. 

(a) Find the distribution of 0. Compare the true distribution of 0 to the 
histograms from the bootstrap. 
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(b) This is a case where the bootstrap does very poorly. In fact, we can 
prove that this is the case. Show that P{0 = 0) = 0 and yet P{0* = 
0) ^ .632. Hint: show that, P(6>* = 0) = 1 — {1 — (1/n))^ then take the 
limit as n gets large. 

8. Let Tn = /i = E(Xi), ak = f jx—juj^dF(x) and ak = n~^ I2?=i i^i~ 
Show that 

4X^S2 , 4Xnas S 4 

'^boot — ^ ^ ^ 
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Parametric Inference 



We now turn our attention to parametric models, that is, models of the form 

y=|/a;e:^ee| (9.1) 

where the 0 C is the parameter space and 6 = (6>i, . . . , 6>/e) is the param- 
eter. The problem of inference then reduces to the problem of estimating the 
parameter 6 . 

Students learning statistics often ask: how would we ever know that the 
distribution that generated the data is in some parametric model? This is 
an excellent question. Indeed, we would rarely have such knowledge which 
is why nonparametric methods are preferable. Still, studying methods for 
parametric models is useful for two reasons. First, there are some cases where 
background knowledge suggests that a parametric model provides a reasonable 
approximation. For example, counts of traffic accidents are known from prior 
experience to follow approximately a Poisson model. Second, the inferential 
concepts for parametric models provide background for understanding certain 
nonparametric methods. 

We begin with a brief discussion about parameters of interest and nuisance 
parameters in the next section, then we will discuss two methods for estimat- 
ing 6^, the method of moments and the method of maximum likelihood. 
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9.1 Parameter of Interest 



Often, we are only interested in some function T{0). For example, if X ^ 
X(/i,cr^) then the parameter is 6^ = (/i,cr). If our goal is to estimate fi then 
jjL = T{6) is called the parameter of interest and a is called a nuisance 
parameter. The parameter of interest might be a complicated function of 0 
as in the following example. 



9.1 Example. Let Xi, . . . ,X^ ^ Normal(/i, cr^). The parameter is 0 = (/i, cr) 
and the parameter space is © = {(/i,cr) : /i G M, cr > 0}. Suppose that Xi is 
the outcome of a blood test and suppose we are interested in r, the fraction 
of the population whose test score is larger than 1. Let Z denote a standard 
Normal random variable. Then 



r = P(X > 1) = 1 -P(X < 1) = 1 -P ( ^ ^ 



= l-¥{Z<i — ^ 



The parameter of interest is r = = 1 — <F((1 — fi)/cr). m 

9.2 Example. Recall that X has a Gamma((a,/3) distribution if 

where o, /3 > 0 and 

pOO 

T{a) = / y^-^e-ydy 
Jo 

is the Gamma function. The parameter is 6^ = (ce,/3). The Gamma distri- 
bution is sometimes used to model lifetimes of people, animals, and elec- 
tronic equipment. Suppose we want to estimate the mean lifetime. Then 
T{a,p) = Ee(Xi) = al3. m 



9.2 The Method of Moments 

The hrst method for generating parametric estimators that we will study 
is called the method of moments. We will see that these estimators are not 
optimal but they are often easy to compute. They are are also useful as starting 
values for other methods that require iterative numerical routines. 




9.2 The Method of Moments 
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Suppose that the parameter 6 = (6>i, . . . , 6>/c) has k components. For 1 < 
j < k, define the moment 

aj = aj{e) = Ee(XJ) = I xHFe(x) (9.2) 

and the sample moment 



1 

an = — 

n 


i=l 


(9.3) 


9.3 Definition. The method of moments estimator On i 

the value of 0 such that 


s defined to be 


^1 (^n) 


— Oi 




^2 (^n) 


= 02 




(^n) 


= O/c- 


(9.4) 



Formula (9.4) defines a system of k equations with k unknowns. 

9.4 Example. Let ^ Bernoulli(p). Then a\ = Ep(X) = p and 

3i = equating these we get the estimator 

1 

Pn — ^ ^ • ■ 

i=l 

9.5 Example. Let ^ Normal(/i, cr^). Then, ai = Eo{Xi) = p 

and 02 = E6>(X^) = V6>(Xi) + (E6>(Xi))^ = cr^ + /i^. We need to solve the 
equations^ 




This is a system of 2 equations with 2 unknowns. The solution is 



p = X, 



^Recall that V(X) = E(X^) - (E(X))T Hence, E(X^) = Y{X) + (E(X))T 
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1 

n 



Y,{Xi-Xn?- ■ 



^=1 

9.6 Theorem. Let On denote the method of moments estimator. Under appro- 
priate conditions on the model, the following statements hold: 

1. The estimate On exists with probability tending to 1. 

^ p 

2. The estimate is consistent: On — >0. 

3. The estimate is asymptotically Normal: 



- 0 ) ^ 



where 

E = g^e{YY^)g^, 

y = (X, . . . , g = ( 51 , . . . , 5 fe) and g, = dap {9)/ 80. 

The last statement in the theorem above can be used to hnd standard errors 
and conhdence intervals. However, there is an easier way: the bootstrap. We 
defer discussion of this until the end of the chapter. 



9.3 Maximum Likelihood 

The most common method for estimating parameters in a parametric model is 
the maximum likelihood method. Let . . ., Xn be IID with PDF f{x; 0). 

9.7 Definition. The likelihood function is defined by 

n 

Cn{0) = l[f{X,;0). (9.5) 

The log- likelihood function is defined by in{0) = log Cn{0)- 

The likelihood function is just the joint density of the data, except that we 
treat it is a function of the parameter 0. Thus, Cn ^ ^ [0,oo). The 
likelihood function is not a density function: in general, it is not true that 
Cn{0) integrates to 1 (with respect to 0). 

9.8 Definition. The maximum likelihood estimator mle, denoted by 
On, is the value of 0 that maximizes Cn{0). 







9.3 Maximum Likelihood 
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FIGURE 9.1. Likelihood function for Bernoulli with n = 20 and — 12. The 

MLE is Pn = 12/20 = 0.6. 

The maximum of occurs at the same place as the maximum of Cn{0)^ 
so maximiziug the log-likelihood leads to the same auswer as maximiziug the 
likelihood. Ofteu, it is easier to work with the log-likelihood. 

9.9 Remark. If we multiply Cn{6) by auy positive coustaut c (uot depeudiug 
ou 6) theu this will uot chauge the MLE. Heuce, we shall ofteu drop coustauts 
iu the likelihood fuuctiou. 

9.10 Example. Suppose that Xi, . . . , ^ Beruoulli(p). The probability fuuc- 

tiou is f{x;p) = p^{l — for x = 0, 1. The uukuowu parameter is p. Theu, 

n n 

^n{p) = P) = =/(l -pr~^ 

i=l i=l 

where S = "^^Xi. Heuce, 

in{p) = S\ogp^ (n - S')log(l -p). 

Take the derivative of set it equal to 0 to hud that the MLE is p^ = S/n. 

See Figure 9.1. ■ 



9.11 Example. Let Xi, . . . ,X^ ^ The parameter is 6 

the likelihood fuuctiou (iguoriug some coustauts) is: 



= -m)"| 



and 



r„(p,cr) 
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( nS"^ \ ( n{X - Ilf \ 

= " 

where X = n~^ sample mean and S‘^ = n~^ ^i{^i — X^ . The 

last equality above follows from the fact that ^^{Xi — /i)^ = + n(X — /i)^ 

which can be verihed by writing ^^{Xi — /i)^ = ^^{Xi — X + X — /i)^ and 
then expanding the square. The log-likelihood is 



£{ld,a) 

Solving the equations 



^ nS^ 

nloga-^ 



n{X-iif 

2^^ 



9-£(/i, cr) 
dji 



and 



9-£(/i, a) 

da 



we conclude that yh = X and a = S. It can be verihed that these are indeed 
global maxima of the likelihood. ■ 



9.12 Example (A Hard Example). Here is an example that many people hnd 
confusing. Let Xi, . . . , X^ ^ [/ni/(0, 0). Recall that 




0<x<0 

otherwise. 



Consider a hxed value of 0. Suppose 6 < Xi for some i. Then, f{Xi;0) = 0 
and hence Cn{0) = f{Xi; 6) = 0. It follows that Cn{0) = 0 if any Xi > 6. 
Therefore, Cn{0) = 0 it 0 < X(^) where X(^) = maxjXi, . . . , X^}. Now 
consider any 0 > X(^). For every Xi we then have that /(X^; 0) = 1/0 so that 
^n(^) = fi^ii conclusion. 



C 



n 





0 > X(n) 

0 < V")- 



See Figure 9.2. Now £„(6>) is strictly decreasing over the interval [X(„),oo). 
Hence, 6»„ = X(„). ■ 



The maximum likelihood estimators for the multivariate Normal and the 
multinomial can be found in Theorems 14.5 and 14.3. 



9.4 Properties of Maximum Likelihood Estimators 

Under certain conditions on the model, the maximum likelihood estimator 0^ 
possesses many properties that make it an appealing choice of estimator. The 
main properties of the MLE are: 
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^ P 

1. The MLE is consistent: On — ^ where denotes the true value of the 

parameter 6] 

2. The MLE is equivariant: if On is the MLE of 0 then g{0n) is the MLE of 

3. The MLE is asymptotically Normal: {0 — 0^)/se iV(0, 1); also, the 

estimated standard error ^ can often be computed analytically; 

4. The MLE is asymptotically optimal or efficient: roughly, this means 
that among all well-behaved estimators, the MLE has the smallest vari- 
ance, at least for large samples; 

5. The MLE is approximately the Bayes estimator. (This point will be ex- 
plained later.) 



We will spend some time explaining what these properties mean and why 
they are good things. In sufficiently complicated problems, these properties 
will no longer hold and the MLE will no longer be a good estimator. For now 
we focus on the simpler situations where the MLE works well. The properties 
we discuss only hold if the model satisfies certain regularity conditions. 
These are essentially smoothness conditions on f{x;0). Unless otherwise 
stated, we shall tacitly assume that these conditions hold. 



9.5 Consistency of Maximum Likelihood Estimators 

Consistency means that the MLE converges in probability to the true value. 
To proceed, we need a definition. If / and g are pdf’s, define the Kullback- 
Leibler distance ^ between / and g to be 

D{f,g) = j f{x) log dx. (9.6) 

It can be shown that D{f^g) > 0 and D(/, /) = 0. For any G © write 
D{0^'ijj) to mean D{f{x; 0)^f{x; V^)). 

We will say that the model ^ is identifiable iiO ^ ip implies that D(0^ p)) > 
0. This means that different values of the parameter correspond to different 
distributions. We will assume from now on the the model is identifiable. 



^This is not a distance in the formal sense because D{f,g) is not symmetric. 
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Let denote the true value of 6. Maximizing £n(^) is equivalent to maxi- 
mizing 



Mn{e) 



1 

n 



log 



/(V; 0 ) 

fiXf, e,)' 



This follows since Mn{0) = n —£n{(^*)) and in(d*) is a constant (with 

respect to 9). By the law of large numbers, Mn{9) converges to 






log 



/(V; 0 ) 

f{Xf, 9,) 



= -D{0,,0). 



Hence, Mn{0) ^ —D{0^^0) which is maximized at 0^ since —D{6^^6^) = 0 
and —D(O^^O) < 0 for 0 ^ 0^. Therefore, we expect that the maximizer will 
tend to 0^. To prove this formally, we need more than Mn{0 ) — ^ — D{0^^0). 
We need this convergence to be uniform over 0. We also have to make sure 
that the function D{6^^6) is well behaved. Here are the formal details. 



9.13 Theorem. Let 6^ denote the true value of 0. Define 



Mn{0) 



1 

n 



^log 



/(V; 0 ) 

fiXf, 9V 



and M{9) = —D(6^^6). Suppose that 

sup \Mn{9) - M{9)\^0 
eee 



and that, for every e > 0, 



sup M((9) < M((9^). 

6>:|6>-6>*|>e 



(9.7) 



(9.8) 



Let Orn denote the mle. Then Or. 



The proof is in the appendix. 



9.6 Equivariance of the mle 

9.14 Theorem. Let r = g(0) be a function ofO. Let On be the mle ofO. Then 
Tn = g{0rn) is the MLE of T . 
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Proof. Let h = g ^ denote the inverse of g. Then On = h{rn). For any r, 
= n* Ht)) = n* where 9 = h{T). Hence, for any r, 

£„(r) = £{9) < C{9) = £„(f). . 

9.15 Example. Let ^ N{6^1). The MLEfor 9 On = Xn- Let 

r = e^. Then, the mle for r is r = = e^. ■ 



9.7 Asymptotic Normality 

It turns out that the distribution of On is approximately Normal and we can 
compute its approximate variance analytically. To explore this, we hrst need 
a few dehnitions. 



9.16 Definition. The score function is defined to be 



siX;9) = 



dlogf{X;9) 



The Fisher information is defined to be 



In{9) = 



Yye{s{Xi-e)). 



For n = 1 we will sometimes write 1(0) instead of li(0). It can be shown 
that E6 )(s(X; 0)) = 0. It then follows that V6>(5(X; 0)) = 0)). In fact, 

a further simplihcation of ln(0) is given in the next result. 



9.17 Theorem. In(0) = nl(0). Also, 



m = 



dHogf{X-9) 



^2 log f{x; 9) 



(9.11) 







9.7 Asymptotic Normality 
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9.18 Theorem (Asymptotic Normality of the mle). Let se = yW{0n)- 
Under appropriate regularity conditions, the following hold: 



1. se ^,/l/In{0) and 



%-e) 



iV(0,l). 



2. Let ^ J 1/ In(0n)- Then, 



{On - 0) 



iV(0,l). 



The proof is in the appendix. The first statement says that On ~ N{0, se) 
where the approximate standard error of On is se = y/l//^(6>). The second 
statement says that this is still true even if we replace the standard error by 
its estimated standard error ^ 

Informally, the theorem says that the distribution of the mle can be ap- 
proximated with N{0,se^). From this fact we can construct an (asymptotic) 
conhdence interval. 



9.19 Theorem. Let 



Cn — I — ^a/2 S6, On + ^a/2 SO ) . 



Then, ^e{^ ^ ^n) ^ 1 — ce as n ^ oo. 



Prooe. Let Z denote a standard normal random variable. Then, 



W ^ Cn 



l^n — ^a/2 se < 6^ < + ^a/2 ■ 



-^a/2 < < Za/2 



IP(-^a/2 < ^ < Z^/ 2 ) = 1-a. 



For a = .05, = 1.96 2, so: 



On ^2 ^ 



is an approximate 95 percent conhdence interval. 
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When you read an opinion poll in the newspaper, you often see a statement 
like: the poll is accurate to within one point, 95 percent of the time. They are 
simply giving a 95 percent conhdence interval of the form ± 2 



9.20 Example. Let ^ Bernoulli(p). The mle is pn = 

and f(x;p) =p*(l log f{x;p) = xlogp+ (1 -x)log(l -p), 



and 



Thus, 



Hence, 



s{X;p) = , 

p 1 — p 

-s{X;p) — — -\- 



I{p)=E^{-s\X;p)) = ^^ 



p‘^ (1 — 

p (1-p) 



(1 — p)^ p(l — p) 



1 



se = 



^ 1 ^ f F(i - F) 1 

\Jln{Pn) \/nI(Pn) I ^ J 



An approximate 95 percent conhdence interval is 

1/2 

^nV-L “ PnJ I 

V 



Pn±2 \MLJA\ 
n j 



9.21 Example. Let ^ N(0^a‘^) where cr^ is known. The score 

function is s(X; 0) = {X — 0)/a‘^ and s'{X; 6) = — 1/cr^ so that I\{6) = 1/cr^. 
The MLE is On = Xn- According to Theorem 9.18, Xn ~ N{0^ a‘^ /n). In this 
case, the Normal approximation is actually exact. ■ 



9.22 Example. Let Xi, . . . ,X^ ^ Poisson(A). Then = Xn and some cal- 
culations show that /i(A) = 1/A, so 



se = 



1 



'nI{Xn) 




Therefore, an approximate 1 — a conhdence interval for A is A^ ±Zq,/ 2 Y 



9.8 Optimality 

Suppose that Xi, . . . ,X^ ^ X(6>, cr^). The mle is On = X^. Another reason- 
able estimator of 0 is the sample median The mle satishes 




9.9 The Delta Method 
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It can be proved that the median satisfies 

Vn(0n -0) N ■ 

This means that the median converges to the right value but has a larger 
variance than the mle. 

More generally, consider two estimators and 11 ^ and suppose that 
and that 

y^{Un-e) N{0,u^). 

We define the asymptotic relative efficiency of f/ to T by are({7, T) = 

In the Normal example, are(6>^,6>^) = 2/tt = .63. The interpretation is that 
if you use the median, you are effectively using only a fraction of the data. 

9.23 Theorem. If On is the MLE and On is any other estimator then ^ 

ARF.{0n,0n) < 1 . 

Thus, the mle has the smallest (asymptotie) varianee and we say that the 

MLE is efficient or asymptotically optimal. 

This result is predicated upon the assumed model being correct. If the model 
is wrong, the MLE may no longer be optimal. We will discuss optimality in 
more generality when we discuss decision theory in Chapter 12. 



9.9 The Delta Method 



Let r = g{0) where ^ is a smooth function. The maximum likelihood esti- 
mator of r is T = g{0). Now we address the following question: what is the 
distribution of f? 



9.24 Theorem (The Delta Method). If r = g{0) where g is differentiable 
and g'{0) 0 then 



(L - t) 

€e{T) 



N{0,1) 



(9.15) 



^The result is actually more subtle than this but the details are too complicated to consider 
here. 
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where Tn = g{6n) and 

se{Tn) = \g'(0)\^(0n) (9.16) 

Hence, if 

Cn = - Zaj2 f„ + Zaj2 ^{Tn)^ (9-17) 

then IP^(t g Cjfj — y 1 — ex cls n — y oo. 



9.25 Example. Let X\, . . . ^ Bernoulli(p) and let fj = g{p) = log(p/(l — 

p)). The Fisher information function is I{p) = l/(p(l — p)) so the estimated 
standard error of the MLE p^ is 

. , jtnil-f.) 

V n 

The MLE of is = logp/(l —p). Since, g\p) = l/(p(l — p)), according to 
the delta method 



Gn) = \9'(.PnWe{Pn 



'npn{l -Pn 



An approximate 95 percent conhdence interval is 

2 

y^npn{l -Pn) 

9.26 Example. Let Xi, . . . , X^ ^ A^(/i,cr^). Suppose that g is known, a is 
unknown and that we want to estimate = logcr. The log-likelihood is £{a) = 
— nlogcr — 2 ^ ~ hY ' Differentiate and set equal to 0 and conclude that 

‘"”=V n ■ 

To get the standard error we need the Fisher information. First, 



log /(^;cr) = - logo- 



(X - pf 



with second derivative 



1 3(X - pf 



rr ^ 1 3cr2 2 

— 2 T ~ 

/T^ 






and hence 
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Therefore, ^ = dn ! Let ip 

9 ' = l/o-, 

^(^n) = 



= g{a) = log (7. Then, 

I 1 



log^n- Since 



and an approximate 95 percent confidence interval is p)n ± ■ 



9.10 Multiparameter Models 



These ideas can directly be extended to models with several parameters. Let 
Q = (6>i, . . . , 6>/e) and let = (6>i, . . . , Q\P) be the mle. Let = YT%=\ 



= ^ and H,, 



90] 



dOjddk ’ 



Define the Fisher Information Matrix by 



ln{0) 



Ee{Hn) Ee{H,2) ■■■ 

EeiH^i) Ee(H22) ■■■ E0{H2k) 



lEeiHki) Ee{Hk2) ■■■ Ee{Hkk) 

Let Jn{0) = I~^{0) be the inverse of 



(9.18) 



9.27 Theorem. Under appropriate regularity conditions, 



{9-e)^N{0,Jn). 

Also, if Oj is the j**' component of 0, then 

(Oj-Oj) ^ jv(0,l) (9.19) 

sej 

where the diagonal element of J^. The approximate co- 

variance of Oj and Ok is Cov{0j^0k) ~ Jn{j^k). 



There is also a multiparameter delta method. Let r 
function and let 



dg 

dO 



g{0i, ...,0k) be a 






V ^ / 

V dOk ) 



be the gradient of g. 
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9.28 Theorem (Multi para meter delta method). Suppose that V g evaluated at 
0 is not 0. Let r = g{0). Then 



(t-t) 

se(f) 



N{0,1) 



where 

se(f) = JiygVMVg), (9.20) 



Jn = Jni^n) Vg IS Vg evaluated at 6 = 0. 



9.29 Example. Let Xi, . . . , ^ N{ji^ cr^). Let r = ^(/i, a) = a/ g. In Excer- 

cise 8 you will show that 



In{P-,V 



n 

a2 

0 



0 



Hence, 



Jn 



4 



The gradient of g is 



V^ = 



(7 

0 



0 



cr 



2 



2 J 




Thus, 



^(f) 



{VgYUVg) 



1/1 

/n y 2/12’ 



9.11 The Parametric Bootstrap 

For parametric models, standard errors and conhdence intervals may also be 
estimated using the bootstrap. There is only one change. In the nonparametric 
bootstrap, we sampled X^, . . . , X* from the empirical distribution In the 
parametric bootstrap we sample instead from f{x; On). Here, On could be the 
MLE or the method of moments estimator. 

9.30 Example. Consider example 9.29. To get the bootstrap standard er- 
ror, simulate Xi,...,X* ^ X(/I, compute g* = = 

— yu*)^. Then compute f* = Repeating this B 

times yields bootstrap replications 

^>i< ^>i< 

Ti, . . . ,Tb 
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and the estimated standard error is 



S^boot 



E 



B 

h=l 



B 




The bootstrap is much easier than the delta method. On the other hand, 
the delta method has the advantage that it gives a closed form expression for 
the standard error. 



9.12 Checking Assumptions 

If we assume the data come from a parametric model, then it is a good idea to 
check that assumption. One possibility is to check the assumptions informally 
by inspecting plots of the data. For example, if a histogram of the data looks 
very bimodal, then the assumption of Normality might be questionable. A 
formal way to test a parametric model is to use a goodness-of-fit test. See 
Section 10.8. 



9.13 Appendix 

Since On maximizes Mn{0), we have Mn{0n) > 

Mn{0,) - M{0n) + M{0,) - Mn{0,) 

Mn{0n) - M{0n) + M{0,) - Mn{0,) 

sup \Mn{0) - M{0)\ + M{0,) - Mn{0.) 

0 

0 

where the last line follows from (9.7). It follows that, for any J > 0, 

P (m(A) < M{e,) 

Pick any e > 0. By (9.8), there exists (5 > 0 such that — 0*| > e implies that 
M{9) < M{6^) - 5. Hence, 

P(|A - 0*1 > e) < I® (^(0n) < M(e^) -S^^O. m 

Next we want to prove Theorem 9.18. First we need a lemma. 



9.13.1 Proofs 

Proof of Theorem 9.13. 
Mn{0Pj. Hence, 

M{0,) - M{0n) = 

< 
< 

p 
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9.31 Lemma. The score function satisfies 



[s{X;0)]=0. 



Proof. Note that 1 = / f{x; 6)dx. Differentiate both sides of this equation 
to conclude that 



^ d 

m 



f{x] e)dx = j e)d0L 



I- df^ 0 ) f dlogf{x;e) 

= J = J — 

= J s(x;0)f{x-,9)dx = 'E 0 s{X;O). m 

Proof of Theorem 9.18. Let £{6) = log C{6). Then, 

O = T(0)«T(6») + (0-6i)T'((9). 



Rearrange the above equation to get 9 — 9 = —I' {9) f I" {9) or, in other words, 



V^{9-9) 



= TOP 

_if'(6i) “ BOTTOM' 



Let Yi = dlog f{Xi; 6)/ dO. Recall that E(F^) = 0 from the previous lemma 
and also V(F^) = I{0). Hence, 



TOP = = VnV = ^^(T - 0) IT ~ iV(0, 1{9)) 

i 

by the central limit theorem. Let Ai — —d‘^ log f{Xi; 0) /d0‘^ . Then E(H^) = 
I{0) and 

BOTTOM = A A J(6») 

by the law of large numbers. Apply Theorem 5.5 part (e), to conclude that 






^ P 

Assuming that I{0) is a continuous function of 6^, it follows that 1(0^) — > I (0). 
Now 



EE = ^r/\9„)(9r,-9) 

se 
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The first term tends in distribution to N(0,1). The second term tends in 
probability to 1. The result follows from Theorem 5.5 part (e). ■ 

Outline of Proof of Theorem 9.24. Write 



Tn = g{0n) » g{0) + (6>„ - 0)g'{9) = T + (6i„ - 0)g'{9). 



Thus, 



and hence 



Vn{Tn - r) « \/n{0n - 0)g'{0), 






Theorem 9.18 tells us that the right-hand side tends in distribution to a N (0,1). 
Hence, 



yn/(6>)(T„ - r) 



or, in other words. 



where 



g'm 

N (T,se^(?„)) 

sev„) = 



W(0,1) 



n/((9) ‘ 

The result remains true if we substitute 6^^ for ^ by Theorem 5.5 part (e). 



9.13.2 Sufficiency 

A statistic is a function T(X^) of the data. A sufficient statistic is a statistic 
that contains all the information in the data. To make this more formal, we 
need some dehnitions. 



9.32 Definition. Write o if f{x^; 6) = cf{y'^; 0) for some constant 
c that might depend on x^ and but not 0. A statistic T{x'^) is 
sufficient ifT{x'^) o T(y'^) implies that x'^ o 



Notice that if eo then the likelihood function based on x^ has the 
same shape as the likelihood function based on y^. Roughly speaking, a statis- 
tic is sufficient if we can calculate the likelihood function knowing only T(X^). 

9.33 Example. Let Xi,...,X^ ^ Bernoulli(p). Then C{p) = p^ {1 — p)^~^ 
where S = Xi, so S is sufficient. ■ 
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9.34 Example. Let Xi, . . . ^ X(/i, a) and let T = (X, S). Then 

where S‘^ is the sample variance. The last expression depends on the data 
only through T and therefore, T = (X,/S) is a sufficient statistic. Note that 
U = (17 X, S) is also a sufficient statistic. If I tell you the value of U then you 
can easily figure out T and then compute the likelihood. Sufficient statistics 
are far from unique. Consider the following statistics for the X(/i,cr^) model: 

Ti(X-) = (Xi,...,X,) 

T2(X-) = (X,^) 

Ts{X^) = X 
T4(X-) = (X,^,X3). 

The first statistic is just the whole data set. This is sufficient. The second 
is also sufficient as we proved above. The third is not sufficient: you can’t 
compute £(/i, a) if I only tell you X. The fourth statistic T4 is sufficient. The 
statistics T\ and T4 are sufficient but they contain redundant information. 
Intuitively, there is a sense in which T2 is a “more concise” sufficient statistic 
than either Ti or T4. We can express this formally by noting that T2 is a 
function of Ti and similarly, T2 is a function of T4. For example, T2 = giT^) 
where g{ai,a2,as) = (ai,a2). ■ 



9.35 Definition. A statistic T is minimal sufficient if (i) it is 

sufficient; and (ii) it is a function of every other suffieient statistie. 



9.36 Theorem. T is minimal sufficient if the following is true: 

T{x^) = T{y^) if and only if o y^ . 

A statistic induces a partition on the set of outcomes. We can think of 
sufficiency in terms of these partitions. 

9.37 Example. Let Xi,X2 ^ Bernoulli{6). Let V = Xi, T = 

U = (T, Xi). Here is the set of outcomes and the statistics: 





w 


V 


T 


u 


0 


0 


0 


0 


(0,0) 


0 


1 


0 


1 


(1,0) 


1 


0 


1 


1 


(1,1) 


1 


1 


1 


2 


(2,1) 
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The partitions induced by these statistics are: 

F ^ {(0,0), (0,1)}, {(1,0), (1,1)} 

T ^ {(0,0)}, {(0,1),(1,0)}, {(1,1)} 

U {(0,0)}, {(0,1)}, {(1,0)}, {(1,1)}. 

Then V is not sufficient but T and U are sufficient. T is minimal sufficient; 
U is not minimal since if = (1,0) and = (0,1), then o yet 
U(x^) ^ U{y^). The statistic W = 17 T generates the same partition as T. It 
is also minimal sufficient. ■ 

9.38 Example. For a A^(/i,cr^) model, T = (X, 5) is a minimal sufficient 
statistic. For the Bernoulli model, T = ^ • Xi is a minimal sufficient statistic. 
For the Poisson model, T = ^ • Xi is a minimal sufficient statistic. Check that 
T = i^^Xi^Xi) is sufficient but not minimal sufficient. Check that T = X\ 
is not sufficient. ■ 

I did not give the usual dehnition of sufficiency. The usual dehnition is this: 
T is sufficient if the distribution of X^ given T(X^) = t does not depend on 

0. In other words, T is sufficient if /(xi, . . . , 6>) = h{xi ^ . . . , t) where 

h is some function that does not depend on 6. 

9.39 Example. Two coin flips. Let X = (Xi,X 2 ) ^ Bernoulli(p). Then T = 
Xi + X 2 is sufficient. To see this, we need the distribution of (Xi,X 2 ) given 
T = t. Since T can take 3 possible values, there are 3 conditional distributions 
to check. They are: (i) the distribution of (Xi,X 2 ) given T = 0: 

P(Xi =0,X2 =0|t = 0) = l,P(Xi =0,X2 = l|t = 0) =0, 

P(Xi = 1,X2 = 0|t = 0) = 0, P(Xi = 1,X2 = l|t = 0) = 0; 



(ii) the distribution of (Xi,X 2 ) given T = 1: 



P{X, 


= 0,X2 


= 0|t 


= 


1) 


= 0 . p(Xi 


= 0,X2 


= 1IP 


= 1) = 


1 

2’ 


P(Xi = 


1,X2 = 


0|t = 


1) 


= 


p(A-.= 


1,X2 = 


l\t = 


1) = 0: 


: and 


(hi) the distribution of 


(w,. 


X 2 ) given T = 2 










P(Xi 


= 0,X2 


= 0\t 


= 


2) 


= 0, P(Xi 


= 0,X2 


= l\t 


= 2) = 


0, 


P(Xi 


= 1,X2 


= 0|t 


= 


2) 


= 0, P(Xi 


= 1,X2 


= l\t 


= 2) = 


1. 



None of these depend on the parameter p. Thus, the distribution of Xi, X 2 IT 
does not depend on so T is sufficient. ■ 
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9.40 Theorem (Factorization Theorem). T is sufficient if and only if there are 
functions g{t^ 6) and h{x) such that f{x^; 6) = g{t{x^)^ 0)h{x^). 

9.41 Example. Return to the two coin flips. Let t = xi + X 2 - Then 

f{xi,X2;0) = f{xi;9)f{x2;0) 

= §{t,9)h{xi,X2) 

where g{t,9) = 9*{l — 9)‘^~* and h{xi,X 2 ) = 1- Therefore, T = Xi + X 2 is 
sufficient. ■ 

Now we discuss an implication of sufficiency in point estimation. Let 6 be 
an estimator of 6. The Rao-Blackwell theorem says that an estimator should 
only depend on the sufficient statistic, otherwise it can be improved. Let 
R{6, 6) = E6)(6> — 6^ denote the MSE of the estimator. 

9.42 Theorem (Rao-Blackwell). Let 6 he an estimator and letT he a suffieient 
statistic. Define a new estimator hy 

d=E{0\T). 

Then, for every 6, R{6, 6) < R{6, 6). 

9.43 Example. Consider flipping a coin twice. Let 6 = Xi. This is a well- 

dehned (and unbiased) estimator. But it is not a function of the sufficient 
statistic T = Xi + X 2 . However, note that 9 = E(Xi|T) = (Xi + X^j2. By 
the Rao-Blackwell Theorem, 9 has MSE at least as small as 9 — X\. The 
same applies with n coin flips. Again dehne 9 = X\ and T = J2iXi- Then 
9 = E(Xi|T) = improved MSE. ■ 

9.13.3 Exponential Families 

Most of the parametric models we have studied so far are special cases of 
a general class of models called exponential families. We say that {f{x; 9) : 
9 G 0} is a one-parameter exponential family if there are functions g{9), 
B{9), T{x) and h{x) such that 

f{x;9) = 



It is easy to see that T{X) is sufficient. We call T the natural sufficient 
statistic. 
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9.44 Example. Let X ^ Poisson(6>). Then 

nx^-O 1 
XI XI 

and hence, this is an exponential family with tj(^) = log 6>, B(^) = 6>, T(x) = x, 
h{x) = l/x\. m 

9.45 Example. Let X ^ Binomial(n, 6>). Then 

/(x;6») = expjxlog ~ '^)| • 

In this case, 

r?(6>) = log X(^) = -nlog(d) 

and 

T(x) = X, h{x) = . 

■ 

We can rewrite an exponential family as 

where 77 = rj{ 0 ) is called the natural parameter and 

A(r?) = log I h{x)e^'^^^Ux. 

For example a Poisson can be written as f{x; rj) = jx\ where the natural 

parameter is ry = log 0. 

Let be IID from an exponential family. Then f{x^;0) is an 

exponential family: 

where hn(x^) = Ylih{xi), Tn{x^) = ^iT{xi) and Bn{0) = nB{6). This 
implies that ^j^T{Xi) is sufficient. 

9.46 Example. Let Xi, . . . ,X^ ^ Uniform(0, 6>). Then 

= F/(x(„) < e) 

where / is 1 if the term inside the brackets is true and 0 otherwise, and 
^(n) = max{xi ^ . . . , Thus T(X^) = maxjXi, . . . , X^} is sufficient. But 
since T(X^) ^ '^iT(Xi), this cannot be an exponential family. ■ 
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9.47 Theorem. Let X have density in an exponential family. Then, 

E(T(X)) = A'{rj), V(T(X)) = A'\rj). 

6 = {6i, ... ,6k) is a vector, then we say that f{x;0) has exponential 
family form if 

f{x;d) = /i(x)exp i^J2vj{0)Tj{x) - B{d) 

Again, T = (Ti , . . . ,Tk) is sufficient. An IID sample of size n also has expo- 
nential form with sufficient statistic Ti(X^), . . . , Tk{Xi)). 

9.48 Example. Consider the normal family with 6 = Now, 

fix; 0) = exp I ^ - 1 + log(27ra2)) | . 

This is exponential with 

ViiV = Ti{x) = X 
mio) = T2 {x) = x^ 

B{0) = 1 + log(27r(j2)^ , h{x) = 1. 

Hence, with n IID samples, {^iXi,^-X"f) is sufficient. ■ 

As before we can write an exponential family as 

fix; rfj = hix) exp {T^(x)r? - A(r?)} , 
where A{r]) = log f h{x)e ^^ It can be shown that 
E(T(X)) = i(ry) V(T(X))=i(ry), 

where the hrst expression is the vector of partial derivatives and the second 
is the matrix of second derivatives. 



9.13.4 Computing Maximum Likelihood Estimates 

In some cases we can hnd the mle 0 analytically. More often, we need to 
hnd the mle by numerical methods. We will briefly discuss two commonly 




9.13 Appendix 143 



used methods: (i) Newton-Raphson, and (ii) the EM algorithm. Both are 
iterative methods that produce a sequence of values that, under 

ideal conditions, converge to the mle 6. In each case, it is helpful to use a 
good starting value 6^ . Often, the method of moments estimator is a good 
starting value. 

Newton-Raphson. To motivate Newton-Raphson, let’s expand the deriva- 
tive of the log-likelihood around 6^: 

0 = i\e) ^ + (e- 



Solving for 6 gives 



0^0^ - 



£"{ooy 



This suggests the following iterative scheme: 



^J + l = Q3 _ 



£'{0y 

£"{ey 



In the multiparameter case, the mle 6 = (6>i, . . . , 6>/c) is a vector and the 
method becomes 

= 9^ -H-^£{ey 

where £ (Oy is the vector of hrst derivatives and H is the matrix of second 
derivatives of the log-likelihood. 

The EM Algorithm. The letters EM stand for Expectation-Maximization. 
The idea is to iterate between taking an expectation then maximizing. Sup- 
pose we have data Y whose density f{y;0) leads to a log-likelihood that is 
hard to maximize. But suppose we can hnd another random variable Z such 
that f{y; 9) = f f{y, z; 9) dz and such that the likelihood based on /(^, 9) 

is easy to maximize. In other words, the model of interest is the marginal of a 
model with a simpler likelihood. In this case, we call Y the observed data and 
Z the hidden (or latent or missing) data. If we could just “hll in” the missing 
data, we would have an easy problem. Conceptually, the EM algorithm works 
by hlling in the missing data, maximizing the log-likelihood, and iterating. 



9.49 Example (Mixture of Normals). Sometimes it is reasonable to assume that 
the distribution of the data is a mixture of two normals. Think of heights of 
people being a mixture of men and women’s heights. Let (j){y; a) denote 
a normal density with mean y and standard deviation a. The density of a 
mixture of two Normals is 



fir, ^) = (1 - p)<P{y; mo, <jo) + pHv, pi, ^i)- 




144 



9. Parametric Inference 



The idea is that an observation is drawn from the hrst normal with probability 
p and the second with probability 1—p. However, we don’t know which Normal 
it was drawn from. The parameters are 0 = (/io, cro, cri,p). The likelihood 
function is 

n 

^0) = 14 [(1 - p)4>{yu Mo, o-q) + p<P(,yi; Ml, o-i)] . 

Maximizing this function over the hve parameters is hard. Imaging that we 
were given extra information telling us which of the two normals every observa- 
tion came from. These “complete” data are of the form (Yi, Zi), . . . , 
where = 0 represents the hrst normal and Z^ = 1 represents the second. 
Note that P(Z^ = 1) = p. We shall soon see that the likelihood for the com- 
plete data (Yi, Zi), . . . , (Y^, Z^) is much simpler than the likelihood for the 
observed data Yi, . . . , Y^. ■ 

Now we describe the EM algorithm. 



The EM Algorithm 

(0) Pick a starting value 0^. Now for j = 1 , 2 ,..., repeat steps 1 and 2 
below: 

(1) (The E-step): Calculate 




The expectation is over the missing data treating and the observed 
data Y^ as hxed. 

(2) Eind to maximize J{0\0^). 



We now show that the EM algorithm always increases the likelihood, that 
is, > C{0^). Note that 



/(Y^,Z^;^^+i) 



^ j(Zn|Yn.^J + l) 
g J(^n|yn.5lj) 



Y^=yr 



and hence 



log 






£(6>^+i) 

C(0i) 
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= J{ 0 ^+^\ 0 n + K{fjJj+,) 

where fj = f{y^;0^) and /j+i = /(y”; 6>-’+^) andX(/,y) = J f{x)log{f{x)/g{x))dx 
is the Kullback-Leibler distance. Now, was chosen to maximize J{0\0^). 
Hence, > J{0^\0^) = 0. Also, by the properties of Kullback-Leibler 

divergence, iL(/j,/j+i) > 0. Hence, > C{0^) as claimed. 




9.50 Example (Continuation of Example 9.49). Consider again the mixture of 
two normals but, for simplicity assume that p = 1/2, cri = (72 = 1. The density 

is 11 

f{y; ^1,^2) = ~(l){y;yoA) + 

Directly maximizing the likelihood is hard. Introduce latent variables Zi, . . . , 
where Z^ = 0 if Yi is from (/(^;/io,l), and Z^ = 1 if Yi is from (/(^;/ii,l), 
F{Zi = 1) = P{Zi = 0) = 1/2, f{yi\Zi = 0) = (j){y;yoA) and f{yi\Zi = 1) = 
(/(y;/ii,l). So f{y) = where we have dropped the parameters 

from the density to avoid notational overload. We can write 

f(z, y) = f(z)f{y\z) = ^(p{y; yo, hi, 

Hence, the complete likelihood is 



Yl4>{yi',ho,^Y 



i=l 

The complete log-likelihood is then 






And so 




n 

y^(l -Zi){yi- yo) 

i=l 



1 

2 



n 

J2zi{yi - yi). 



= -lf2(^-EiZ.\yY0Y){yi-yo) - lj2^iZ,\y^,oY){y - yi). 

i=l i=l 

Since Zi is binary, M{Zi\y'^ ,9^ = V{Zi = l\y"',6Y and, by Bayes’ theorem, 

f{yYZi = l-,9YnZi = l) 



F{Zi = l\y”,9Y = 



fiyYZ^ = l■,eYnz^ = l) + f{yYZi = o■,0Ynz^ = 0) 



_ Hvuhi, 

(f){yi-, y{,l)\ + (f){yi-, yl,l)\ 

<!>{yi',hi,Y 
hi,Y + 4>{yi', ho7) 

= T{i). 
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Take the derivative of J{6\0^) with respect to /ii and /i 2 , set them equal to 0 
to get 



— j+i 

/ij 



E n 

i=l 

E n 



and 



= 



We then recompute Ti using 



4+1 



Er=i(i-+ ■ 

and and iterate. ■ 



9.14 Exercises 

1. Let Xi, . . . , Xji ^ Gamma((a, (3). Find the method of moments estimator 
for a and (3. 

2. Let Xi, . . . , Xn ^ Uniform(a, b) where a and b are unknown parameters 
and a <b. 

(a) Find the method of moments estimators for a and b. 

(b) Find the mle a and b. 

(c) Let T = f xdF(x). Find the mle of r. 

(d) Let r be the mle of r. Let r be the nonparametric plug-in estimator 
of T = f xdF(x). Suppose that a = 1, 6 = 3, and n = 10. Find the MSE 
of r by simulation. Find the MSE of r analytically. Compare. 

3. Let Xi, . . . , Xn ^ X(/i, cr^). Let r be the .95 percentile, i.e. P(X < r) = 

.95. 

(a) Find the mle of r. 

(b) Find an expression for an approximate 1 — a conhdence interval for 



(c) Suppose the data are: 



3.23 -2.50 
1.03 -0.07 
0.33 -0.31 
1.54 2.28 
0.39 



1.88 - 0.68 
-0.01 0.76 

0.30 -0.61 
0.42 2.33 



4.43 0.17 

1.76 3.18 

1.52 5.43 

-1.03 4.00 



Find the mle r. Find the standard error using the delta method. Find 
the standard error using the parametric bootstrap. 
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4. Let Xi, . . . , Xn ^ Uniform(0, 0). Show that the mle is consistent. Hint: 
Let Y = max{Xi, ..., X^}. For any c, ¥{Y < c) = P(Xi < c,X 2 < 
c, ...,X^ < c) =P(Xi < c)F(X 2 < c)...F(X^ < c). 

5. Let Xi, . . . ,X^ ^ Poisson(A). Find the method of moments estimator, 
the maximum likelihood estimator and the Fisher information /(A). 

6. Let Xi, Xn ^ X(6>, 1). Dehne 

r 1 if X, > 0 
^ \ 0 if X, < 0. 

Let v^ = p(yi = i). 

(a) Find the maximum likelihood estimator ^ of '0. 

(b) Find an approximate 95 percent conhdence interval for ^|J. 

(c) Dehne ip = (1/n) Yi. Show that ^0 is a consistent estimator of ip. 

(d) Compute the asymptotic relative efficiency of ip to pj. Hint: Use the 
delta method to get the standard error of the mle. Then compute the 
standard error (i.e. the standard deviation) of pj. 

(e) Suppose that the data are not really normal. Show that pj is not 
consistent. What, if anything, does pj converge to? 

7. (Comparing two treatments.) ni people are given treatment 1 and ri 2 
people are given treatment 2. Let Xi be the number of people on treat- 
ment 1 who respond favorably to the treatment and let X 2 be the 
number of people on treatment 2 who respond favorably. Assume that 
Xi ^ Binomial(ni,pi) X 2 ^ Binomial(n 2 ,P 2 )- Let pJ = pi — P 2 - 

(a) Find the MLE ip for pj. 

(b) Find the Fisher information matrix /(pi,p 2 )- 

(c) Use the multiparameter delta method to hnd the asymptotic stan- 
dard error of pj. 

(d) Suppose that ni = ri 2 = 200, Xi = 160 and X 2 = 148. Find ip. Find 
an approximate 90 percent conhdence interval for ip using (i) the delta 
method and (ii) the parametric bootstrap. 

8. Find the Fisher information matrix for Example 9.29. 

9. Let Xi, ...,X^ ^ Normal(/i, 1). Let 0 = and let 0 = he the mle. 
Create a data set (using /i = 5) consisting of n=100 observations. 
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9. Parametric Inference 



(a) Use the delta method to get se and a 95 percent conhdence interval 
for 0. Use the parametric bootstrap to get ^ and 95 percent conhdence 
interval for 0. Use the nonpar ametric bootstrap to get ^ and 95 percent 
conhdence interval for 0. Compare your answers. 

(b) Plot a histogram of the bootstrap replications for the parametric 
and nonparametric bootstraps. These are estimates of the distribution 
of 0, The delta method also gives an approximation to this distribution 
namely, Normal(6>, se^). Compare these to the true sampling distribu- 
tion of 0 (which you can get by simulation). Which approximation — 
parametric bootstrap, bootstrap, or delta method — is closer to the true 
distribution? 

10. Let Xi, ..., ~ Uniform(0, 0). The mle is 6^ = X(^) = max{Xi, ..., X^}. 

Generate a dataset of size 50 with 0 = 1. 

(a) Find the distribution of 6 analytically. Compare the true distribu- 
tion of 6 to the histograms from the parametric and nonparametric 
bootstraps. 

(b) This is a case where the nonparametric bootstrap does very poorly. 
Show that for the parametric bootstrap P(6>* = 6>) = 0, but for the 
nonparametric bootstrap P(6>* = 0) ^ .632. Hint: show that, P(6>* = 
6) = 1 — {1 — (1/n))^ then take the limit as n gets large. What is the 
implication of this? 




10 

Hypothesis Testing and p- values 



Suppose we want to know if exposure to asbestos is associated with lung 
disease. We take some rats and randomly divide them into two groups. We 
expose one group to asbestos and leave the second group unexposed. Then 
we compare the disease rate in the two groups. Consider the following two 
hypotheses: 

The Null Hypothesis: The disease rate is the same in the two groups. 

The Alternative Hypothesis: The disease rate is not the same in the two 
groups. 

If the exposed group has a much higher rate of disease than the unexposed 
group then we will reject the null hypothesis and conclude that the evidence 
favors the alternative hypothesis. This is an example of hypothesis testing. 

More formally, suppose that we partition the parameter space © into two 
disjoint sets ©o and ©i and that we wish to test 

Hq : 6 e Oq versus Hi : 9 e ©i. (10.1) 

We call Hq the null hypothesis and Hi the alternative hypothesis. 

Let X be a random variable and let A be the range of X. We test a hypoth- 
esis by finding an appropriate subset of outcomes R C X called the rejection 
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Retain Null 


Reject Null 


Hq true 


V 


type I error 


Hi true 


type II error 





TABLE 10.1. Summary of outcomes of hypothesis testing. 

region. If X G we reject the null hypothesis, otherwise, we do not reject 
the null hypothesis: 

X e R reject Hq 

X ^ R retain (do not reject) Hq 

Usually, the rejection region R is of the form 

i? = |x : T{x) > c| (10.2) 

where T is a test statistic and c is a critical value. The problem in hy- 
pothesis testing is to find an appropriate test statistic T and an appropriate 
critical value c. 

Warning! There is a tendency to use hypothesis testing methods even 
when they are not appropriate. Often, estimation and confidence intervals are 
better tools. Use hypothesis testing only when you want to test a well-defined 
hypothesis. 

Hypothesis testing is like a legal trial. We assume someone is innocent 
unless the evidence strongly suggests that he is guilty. Similarly, we retain Hq 
unless there is strong evidence to reject Hq. There are two types of errors we 
can make. Rejecting Hq when Hq is true is called a type I error. Retaining 
Hq when Hi is true is called a type II error. The possible outcomes for 
hypothesis testing are summarized in Tab. 10.1. 

10.1 Definition. The power function of a test with rejeetion region R is 
defined by 

(3{0) = ¥e{X G R). (10.3) 

The size of a test is defined to be 

a = sup (10-4) 

9e&o 

A test is said to have level a if its size is less than or equal to a. 
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A hypothesis of the form 0 = Oq is called a simple hypothesis. A hypoth- 
esis of the form 0 > Oq oi 0 < Oq is called a composite hypothesis. A test 
of the form 

Ho : 0 = Oq versus Hi : 0 ^ Oo 
is called a two-sided test. A test of the form 

Hq : 0 < Oq versus Hi : 0 > Oq 
or 

Ho : 0 > Oq versus Hi : 0 < Oo 

is called a one-sided test. The most common tests are two-sided. 

10.2 Example. Let Xi, . . . , ^ X(/i, a) where a is known. We want to test 

Ho : fi < 0 versus iLi : /i > 0. Hence, ©o = (— oo,0] and ©i = (0, oo). 
Consider the test: 

reject Hq if T > c 

where T = X. The rejection region is 

i? = |(xi, . . . , x„) : T(xi, . . . , x„) > c|. 





152 
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FIGURE 10.1. The power function for Example 10.2. The size of the test is the 
largest probability of rejecting Hq when Hq is true. This occurs at yu = 0 hence the 
size is /d(0). We choose the critical value c so that /3(0) = a. 



It would be desirable to find the test with highest power under i^i, among 
all size a tests. Such a test, if it exists, is called most powerful. Finding 
most powerful tests is hard and, in many cases, most powerful tests don’t 
even exist. Instead of going into detail about when most powerful tests exist, 
we’ll just consider four widely used tests: the Wald test,^ the t^st, the 
permutation test, and the likelihood ratio test. 



10.1 The Wald Test 

Let 0 he di scalar parameter, let 0 be an estimate of 0 and let ^ be the 
estimated standard error of 0. 

^The test is named after Abraham Wald (1902-1950), who was a very influential mathe- 
matical statistician. Wald died in a plane crash in India in 1950. 
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10.3 Definition. The Wald Test 

Consider testing 



Ho : 0 = Oo versus Hi : 0 ^ Oq. 
Assume that 0 is asymptotically Normal: 



{0 - Op) 

se 

The size a Wald test is: reject Hq 

W = 



-^(0, 1). 

when |VI^| > ^a /2 where 

O-Oo 



( 10 . 5 ) 



10.4 Theorem. Asymptotically, the Wald test has size a, that is, 

^6>o (1^1 > ^a/ 2 ) 



as n ^ OO. 

Proof. Under 0 = Oq, {0 — Oq)/^ A^(0, 1). Hence, the probability of 

rejecting when the null 0 = Oq is true is 



(\W\>Z^/2) = 



\0-0q\ 

P(|Z| > Za/2) 

a 



> z, 



a/2 



where Z iV(0,l). ■ 

10.5 Remark. An alternative version of the Wald test statistic is W = (0 — 
0o)/seo where seo is the standard error computed at = 6>o- Both versions of 
the test are valid. 



Let us consider the power of the Wald test when the null hypothesis is false. 



10.6 Theorem. Suppose the true value ofO is 0^ ^ 6>o* power [3 {0^) — the 
probability of correctly rejecting the null hypothesis — is given (approximately) 
by 



Oo-O. 



se 



^a/2 



<L 



Oo-O, 




1 -<L 



se 



(10.6) 
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Recall that ^ tends to 0 as the sample size increases. Inspecting (10.6) 
closely we note that: (i) the power is large if 6^ is far from and (ii) the 
power is large if the sample size is large. 

10.7 Example (Comparing Two Prediction Algorithms). We test a prediction 
algorithm on a test set of size m and we test a second prediction algorithm on 
a second test set of size n. Let X be the number of incorrect predictions for 
algorithm 1 and let Y be the number of incorrect predictions for algorithm 
2. Then X ^ Binomial(m,pi) and Y ^ Binomial(n,p 2 )- To test the null 
hypothesis that pi = p 2 write 



Hq \ 5 = 0 versus H\ : 5 ^ ^ 



where 5 = pi — P 2 - The mle is S = pi — p 2 with estimated standard error 



Pl(l -pi) P2{1 -P2) 



se = 

m n 

The size a Wald test is to reject iLo when |IT| > ^a /2 where 



W = 



S-0 

se 



Pi -P2 



Pl(l-Pl) 



P2(l-P2) 

n 



The power of this test will be largest when pi is far from p 2 and when the 
sample sizes are large. 

What if we used the same test set to test both algorithms? The two samples 
are no longer independent. Instead we use the following strategy. Let Xi = 1 
if algorithm 1 is correct on test case i and Xi = 0 otherwise. Let Yi = 1 if 
algorithm 2 is correct on test case and = 0 otherwise. Define Di = Xi—Yi. 
A typical dataset will look something like this: 



Test Case 


V 


1 

II 


1 


1 


0 


1 


2 


1 


1 


0 


3 


1 


1 


0 


4 


0 


1 


-1 


5 


0 


0 


0 


n 


0 


1 


-1 



(5 = E(A) = E(A) - E(r,) = ¥{X, = 1) - F(Y, = 1). 

The nonparametric plug-in estimate of S is S = D = n~^ Yl^=i ^(^^) = 

where S‘^ = n-^ ElLi(A - Df- To test : 5 = f) versus A : 5 / 0 
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we use W = (5/^ and reject Hq if \W\ > ^a/ 2 - This is called a paired 
comparison. ■ 



10.8 Example (Comparing Two Means). Let and Ti, . . be 

two independent samples from populations with means /ii and /i 2 , respec- 
tively. Let’s test the null hypothesis that /ii = /i2- Write this a,8 Hq : 5 = 0 
versus Hi : S 0 where (5 = /ii — /i 2 - Recall that the nonparametric plug-in 
estimate of 5 is 5 = X — Y with estimated standard error 



se — \l ^ 

m n 



where and are the sample variances. The size a Wald test rejects Hq 
when |VL| > Zq ,/2 where 



W = 



S-0 

se 



X-Y 




10.9 Example (Comparing Two Medians). Consider the previous example again 
but let us test whether the medians of the two distributions are the same. 
Thus, Ho : S = 0 versus Hi : S ^ 0 where S = i/i — 1/2 where i/i and 1/2 are 
the medians. The nonparametric plug-in estimate of S is S = i/i —1/2 where i/i 
and P 2 are the sample medians. The estimated standard error ^ of J can be 
obtained from the bootstrap. The Wald test statistic is VL = S/^. m 



There is a relationship between the Wald test and the 1 — a asymptotic 
conhdence interval 0 ± 

10.10 Theorem. The size a Wald test rejects Hq : 0 = Oq versus Hi : 0 ^ Oq 
if and only if Oq ^ C where 

C = (0 — se Zc^/2^ ^YseZc^/2)' 

Thus, testing the hypothesis is equivalent to checking whether the null value 
is in the confidence interval. 



Warning! When we reject Hq we often say that the result is statistically 
significant. A result might be statistically signihcant and yet the size of the 
effect might be small. In such a case we have a result that is statistically sig- 
nificant but not scientifically or practically significant. The difference between 
statistical significance and scientific significance is easy to understand in light 
of Theorem 10.10. Any confidence interval that excludes Oq corresponds to re- 
jecting Hq. But the values in the interval could be close to Oq (not scientifically 
significant) or far from Oq (scientifically significant). See Figure 10.2. 




156 



10. Hypothesis Testing and p- values 






So 



0 



FIGURE 10.2. Scientific significance versus statistical significance. A level a test 
rejects Hq : 0 = Oq if and only if the 1 — a confidence interval does not include 
Oq. Here are two different confidence intervals. Both exclude Oo so in both cases the 
test would reject Hq. But in the first case, the estimated value of 0 is close to Oq so 
the finding is probably of little scientific or practical value. In the second case, the 
estimated value of 6 is far from 6q so the finding is of scientific value. This shows 
two things. First, statistical significance does not imply that a finding is of scientific 
importance. Second, confidence intervals are often more informative than tests. 



10.2 p- values 

Reporting “reject i^o” or “retain i^o” is not very informative. Instead, we 
could ask, for every a, whether the test rejects at that level. Generally, if the 
test rejects at level a it will also reject at level a' > a. Hence, there is a 
smallest a at which the test rejects and we call this number the p-value. See 
Figure 10.3. 



10.11 Definition. Suppose that for every a ^ (0^ 1) have a size a test 
with rejeetion region Then, 

p- value = infja : T(V") G i?a|- 

That is, the p-value is the smallest level at whieh we ean reject Hq. 



Informally, the p-value is a measure of the evidence against Hq: the smaller 
the p-value, the stronger the evidence against Hq. Typically, researchers use 
the following evidence scale: 
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Yes 



Reject? 



No 



p-value 

FIGURE 10.3. p-values explained. For each a we can ask: does our test reject Hq 
at level al The p-value is the smallest a at which we do reject Hq. If the evidence 
against Hq is strong, the p-value will be small. 



p-value evidence 

< .01 very strong evidence against Hq 
.01 - .05 strong evidence against Hq 
.05 - .10 weak evidence against Hq 
> .1 little or no evidence against Hq 



Warning! A large p-value is not strong evidence in favor of Hq. A large 
p-value can occur for two reasons: (i) Hq is true or (ii) Hq is false but the test 
has low power. 

Warning! Do not confuse the p-value with P(i7o|Data). ^ The p-value is 
not the probability that the null hypothesis is true. 

The following result explains how to compute the p-value. 

^We discuss quantities like P(i7o|E)ata) in the chapter on Bayesian inference. 
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10.12 Theorem. Suppose that the size a test is of the form 
reject Hq if and only if T(X^) > Cq,. 



Then, 

p- value = sup f^o(T{X^) > T{x^)) 

6>G0o 

where is the observed value of X^. If Bq = {Oq} then 
p- value = Fe,{T(X^) > T{x^)). 



We can express Theorem 10.12 as follows: 

The p-value is the probability (under Ho) of observing a value of the 
test statistic the same as or more extreme than what was actually 
observed. 



10.13 Theorem. Let w = {6 — Oo)/^ denote the observed value of the 
Wald statistie W. The p-value is given by 

p- value = (1^1 > H) > H) = 2<l>(-|u;|) (10.7) 

where Z ^ N{0, 1). 

To understand this last theorem, look at Figure 10.4. 

Here is an important property of p-values. 

10.14 Theorem. If the test statistie has a eontinuous distribution, then under 
Hq : 0 = Oq, the p-value has a Uniform (0,1) distribution. Therefore, if we 
rejeet Hq when the p-value is less than a, the probability of a type I error is 



In other words, if Ho is true, the p-value is like a random draw from a 
Unif(0, 1) distribution. If Hi is true, the distribution of the p-value will tend 
to concentrate closer to 0. 



10.15 Example. Recall the cholesterol data from Example 7.15. To test if the 
means are different we compute 



W = 



S-0 



X-Y 




216.2 - 195.3 
— , = 3.78. 

V52 + 2.42 
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FIGURE 10.4. The p- value is the smallest a at which you would reject Hq. To 
hud the p- value for the Wald test, we hud a such that \w\ and — \w\ are just at the 
boundary of the rejection region. Here, w is the observed value of the Wald statistic: 
w = {6 — Oo)/^. This implies that the p-value is the tail area P(|^| > |tc|) where 
Z- A^(0, 1). 



To compute the p-value, let Z ^ A^(0, 1) deuote a staudard Normal raudom 
variable. Theu, 

p-value = P(|Z| > 3.78) = 2P(Z < -3.78) = .0002 

which is very stroug evideuce agaiust the uull hypothesis. To test if the me- 
diaus are differeut, let ui aud P 2 deuote the sample mediaus. Theu, 



W = 



Vl - V2 



212.5 - 194 



= 2.4 



^ 7.7 

where the staudard error 7.7 was fouud usiug the bootstrap. The p-value is 
p-value = P(|Z| > 2.4) = 2P(Z < -2.4) = .02 
which is stroug evideuce agaiust the uull hypothesis. ■ 



10.3 The Distribution 

Before proceediug we ueed to discuss the x^ distributiou. Let Zi, . . . , Z/j, be 
iudepeudeut, staudard Normals. Let V = Y^i=\ Theu we say that V has 
a x^ distributiou with k degrees of freedom, writteu V ^ xl' probability 
deusity of V is 

^(fc/2)-lg-n/2 

^ 2fc/‘^T{kl2) 

for 'c > 0. It cau be showu that E(U) = k aud V(U) = 2k. We defiue the upper 
a quautile Xk,(x ~ — a) where F is the CDF. That is, IP(x| > Xk,a) ~ 
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FIGURE 10.5. The p- value is the smallest a at which we would reject Hq. To hnd 
the p- value for the Xfc-i f^st, we hnd a such that the observed value t of the test 
statistic is just at the boundary of the rejection region. This implies that the p- value 
is the tail area P(xfc-i > ^)- 

10.4 Pearson’s Test For Multinomial Data 

Pearson’s x^ test is used for multinomial data. Recall that if X = (Xi, . . . , Xk) 
has a multinomial (n,p) distribution, then the MLE of p is p = (pi, . . . ,P/c) = 
(Xi/n, . . . , X]^j Ti^. 

Let po = (Poi, • • • ,Po/c) be some hxed vector and suppose we want to test 
Hq : p = po versus Hi : p ^ p^. 



10.16 Definition. Pearson’s x^ statistic is 

y ^ (Vj - ^ ^ {Xj - Ej)'^ 

^ npoj Ej 



where Ej = lE(Xj) = npoj is the expected value of Xj under Hq. 



10.17 Theorem. Under Hq, T xl-i- Henee the test: rejeet Hq if T > 
Xk-i a asymptotic level a. The p-value is ¥{xk-i > 0 where t is the 
observed value of the test statistic. 

Theorem 10.17 is illustrated in Figure 10.5. 
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10.18 Example (Mendel’s peas). Mendel bred peas with round yellow seeds 
and wrinkled green seeds. There are four types of progeny: round yellow, 
wrinkled yellow, round green, and wrinkled green. The number of each type 
is multinomial with probability p = (pi,P 2 ,P 3 ,P 4 )- His theory of inheritance 
predicts that p is equal to 



_ / 9 3 3 1 \ 

^ Vl6’l6’ 16’ ley ■ 

In n = 556 trials he observed X = (315, 101, 108, 32). We will test Hq : p = po 
versus H\ : p ^ Po- Since, npoi = 312.75, npo 2 = '^Po 3 = 104.25, and npo 4 = 
34.75, the test statistic is 

2 _ (315-312.75)2 (101 - 104.25)2 

^ “ 312.75 ^ 104.25 

(108 - 104.25)" (32 - 34.75)" ^ 

104.25 34.75 ' ' 

The a = .05 value for a y2 is 7.815. Since 0.47 is not larger than 7.815 we do 
not reject the null. The p- value is 



p-value = IP(x 3 > .47) = .93 



which is not evidence against Hence, the data do not contradict Mendel’s 
theory.^* 

In the previous example, one could argue that hypothesis testing is not the 
right tool. Hypothesis testing is useful to see if there is evidence to reject Hq. 
This is appropriate when i7o corresponds to the status quo. It is not useful for 
proving that i7o is true. Failure to reject i7o might occur because i7o is true, 
but it might occur just because the test has low power. Perhaps a confidence 
set for the distance between p and po might be more useful in this example. 



10.5 The Permutation Test 

The permutation test is a nonparametric method for testing whether two 
distributions are the same. This test is “exact,” meaning that it is not based 
on large sample theory approximations. Suppose that Xi, . . ., X^n ^ Fx and 
Hi, . . ., ^ Fy are two independent samples and Hq is the hypothesis that 



^There is some controversy about whether Mendel’s results are “too good.” 
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the two samples are identically distributed. This is the type of hypothesis we 
would consider when testing whether a treatment differs from a placebo. More 
precisely we are testing 



^0 • Fx = Fy versus Hi : Fx ^ Fy. 

Let T(xi, . . . , Xm, yi, ... ,yn) be some test statistic, for example, 

Let N = m-\-n and consider forming all N\ permutations of the data Xi, . . ., 
X^,Yi, . . ., Yn. For each permutation, compute the test statistic T. Denote 
these values by Ti, . . . , Tx\. Under the null hypothesis, each of these values is 
equally likely. ^ The distribution Pq that puts mass 1 /N\ on each Tj is called 
the permutation distribution of T. Let tobs be the observed value of the 
test statistic. Assuming we reject when T is large, the p-value is 

1 

p- value = Po(T’ > Cbs) = yy X] > Cbs)- 

* J = 1 

10.19 Example. Here is a toy example to make the idea clear. Suppose the 
data are: (Xi,X 2 ,Ui) = (1,9,3). Let T(Xi,X 2 ,Fi) = \X -Y\ = 2. The 
permutations are: 

permutation value of T probability 



(1,9,3) 


2 


1/6 


(9,1,3) 


2 


1/6 


(1,3,9) 


7 


1/6 


(3,1,9) 


7 


1/6 


(3,9,1) 


5 


1/6 


(9,3,1) 


5 


1/6 



The p-value is P(T > 2) = 4/6. ■ 

Usually, it is not practical to evaluate all N\ permutations. We can approx- 
imate the p-value by sampling randomly from the set of permutations. The 
fraction of times Tj > tobs among these samples approximates the p-value. 



'^More precisely, under the null hypothesis, given the ordered data values, 
Xi , . . . , Xrm M, . . . , Tn is uniformly distributed over the N\ permutations of the data. 
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Algorithm for Permutation Test 

1. Compute the observed value of the test statistic 

^obs = T{Xi, . . . , Xjn, Fi, . . . , Yn). 

2. Raudomly permute the data. Compute the statistic agaiu usiug the 

permuted data. 

3. Repeat the previous step B times aud let Ti, . . . , deuote the 

result iug values. 

4. The approximate p- value is 

1 ^ 

^ ^ ^obs)- 

J = 1 



10.20 Example. DNA microarrays allow researchers to measure the expres- 
siou levels of thousauds of geues. The data are the levels of messeuger RNA 
(mRNA) of each geue, which is thought to provide a measure of how much 
proteiu that geue produces. Roughly, the larger the uumber, the more active 
the geue. The table below, reproduced from Efrou et al. (2001) shows the 
expressiou levels for geues from teu patieuts with two types of liver caucer 
cells. There are 2,638 geues iu this experimeut but here we show just the hrst 
two. The data are log-ratios of the iuteusity levels of two differeut color dyes 
used ou the arrays. 



Type I Type II 



Patient 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


Gene 1 
Gene 2 


230 

470 


-1,350 

-850 


-1,580 

-.8 


-400 

-280 


-760 

120 


970 

390 


no 

-1730 


-50 

-1360 


-190 

-1 


-200 

-330 



Let’s test whether the mediau level of geue 1 is differeut betweeu the two 
groups. Let h>i deuote the mediau level of geue 1 of Type I aud let h >2 deuote the 
mediau level of geue 1 of Type II. The absolute differeuce of sample mediaus 
is T = |z^i — P 2 I = 710. Now we estimate the permutatiou distributiou by 
simulatiou aud we hud that the estimated p- value is .045. Thus, if we use a 
a = .05 level of siguificauce, we would say that there is evideuce to reject the 
uull hypothesis of uo differeuce. ■ 
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In large samples, the permutation test usually gives similar results to a test 
that is based on large sample theory. The permutation test is thus most useful 
for small samples. 

10.6 The Likelihood Ratio Test 

The Wald test is useful for testing a scalar parameter. The likelihood ratio 
test is more general and can be used for testing a vector- valued parameter. 



10.21 Definition. Consider testing 

Hq : 0 e Bo versus Hi : 0 ^ ©o- 
The likelihood ratio statistic is 



A 



2 log 



suPeee 

supeg0o 



2 log 



m 

C(0o) 



where 6 is the mle and Oq is the mle when 6 is restrieted to lie in Bq. 



You might have expected to see the maximum of the likelihood over ©q 
instead of © in the numerator. In practice, replacing ©q with © has little 
effect on the test statistic. Moreover, the theoretical properties of A are much 
simpler if the test statistic is defined this way. 

The likelihood ratio test is most useful when ©o consists of all parameter 
values 6 such that some coordinates of 6 are fixed at particular values. 

10.22 Theorem. Suppose that 0 = (6>i, . . . , 6>g+i, . . . ^Or)- Let 

Bo = {6^ : (^g+l, . . . ,0r) = (6>o,g+l, • • • , ^0,r)}- 

Let A he the likelihood ratio test statistie. Under Hq : 0 e So, 

A(x") Xr-q,a 

where r — q is the dimension of © minus the dimension of ©q. The p-value 
for the test is ¥{Xr-q > '^)* 

For example, if 0 = (6>i, 6^2, ^3, ^4, ^5) and we want to test the null hypothesis 
that O4 = 0^ = 0 then the limiting distribution has 5 — 3 = 2 degrees of 
freedom. 






10.7 Multiple Testing 165 



10.23 Example (Mendel’s Peas Revisited). Consider example 10.18 again. The 
likelihood ratio test statistic for Hq : p = po versus Hi : p ^ po is 



A 



2 log 



^Po) ) 






2 315 log 



315 

556 

_ 9 _ 

16 



+108 log 



108 

556 

16 



0.48. 



+ 101 log 




+ 32 log 




Under Hi there are four parameters. However, the parameters must sum to 
one so the dimension of the parameter space is three. Under H^ there are no 
free parameters so the dimension of the restricted parameter space is zero. The 
difference of these two dimensions is three. Therefore, the limiting distribution 
of A under Hq is xi and the p- value is 

p-value = P(x 3 > -48) = .92. 

The conclusion is the same as with the x^ test. ■ 

When the likelihood ratio test and the x^ test are both applicable, as in the 
last example, they usually lead to similar results as long as the sample size is 
large. 



10.7 Multiple Testing 

In some situations we may conduct many hypothesis tests. In example 10.20, 
there were actually 2,638 genes. If we tested for a difference for each gene, 
we would be conducting 2,638 separate hypothesis tests. Suppose each test 
is conducted at level a. For any one test, the chance of a false rejection of 
the null is a. But the chance of at least one false rejection is much higher. 
This is the multiple testing problem. The problem comes up in many data 
mining situations where one may end up testing thousands or even millions of 
hypotheses. There are many ways to deal with this problem. Here we discuss 
two methods. 
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Consider m hypothesis tests: 

versus i = 1, . . . , m 

and let Pi, , Pm denote the m p- values for these tests. 

The Bonferroni Method 

Given p- values Pi, ... , Pm^ reject null hypothesis if 

Pi<-. 

m 



10.24 Theorem. Using the Bonferroni method, the prohahility of falsely re- 
jecting any null hypotheses is less than or equal to a. 

Proof. Let R be the event that at least one null hypothesis is falsely 
rejected. Let Ri be the event that the i^^ null hypothesis is falsely rejected. 
Recall that if Ai, . . . , are events then IP(IJ^^i Ai) < Yli=i ^(^0- Hence, 

( m \ m m 

i=i ) i=\ i=i ^ 

from Theorem 10.14. ■ 



10.25 Example. In the gene example, using a = .05, we have that .05/2, 638 = 
.00001895375. Hence, for any gene with p-value less than .00001895375, we 
declare that there is a signihcant difference. ■ 



The Bonferroni method is very conservative because it is trying to make 
it unlikely that you would make even one false rejection. Sometimes, a more 
reasonable idea is to control the false discovery rate (FDR) which is de- 
fined as the mean of the number of false rejections divided by the number of 
rejections. 

Suppose we reject all null hypotheses whose p- values fall below some thresh- 
old. Let mo be the number of null hypotheses that are true and let mi = 
m — mo- The tests can be categorized in a 2 x 2 as in Table 10.2. 

Define the False Discovery Proportion (FDP) 



r v/R ifp> 

\ 0 if P = 0 



The FDP is the proportion of rejections that are incorrect. Next define FDR 
E(FDP). 
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Ho Not Rejected 


Ho Rejected 


Total 


Hq True 


u 


V 


mo 


Ho False 


T 


s 


mi 


Total 


m — R 


R 


m 



TABLE 10.2. Types of outcomes in multiple testing. 



The Benjamini-Hochberg (BH) Method 

1. Let P(i) < • • • < P(m) denote the ordered p- values. 

2. Define 

xcy, \ 1 

= — , and R = max< i : (10.8) 

where Cm is dehned to be 1 if the p- values are independent and 
= YlT=iO^!^) otherwise. 

3. Let T = P(R)] we call T the BH rejection threshold. 

4. Reject all null hypotheses for which P^<T. 



10.26 Theorem (Benjamini and Hochberg). If the procedure above is applied, 
then regardless of how many nulls are true and regardless of the distribution 
of the p-values when the null hypothesis is false, 

FDR = E(FDP) < —a < a. 

m 

10.27 Example. Figure 10.6 shows six ordered p-values plotted as vertical 
lines. If we tested at level a without doing any correction for multiple testing, 
we would reject all hypotheses whose p-values are less than a. In this case, 
the four hypotheses corresponding to the four smallest p-values are rejected. 
The Bonferroni method rejects all hypotheses whose p-values are less than 
a/m. In this case, this leads to no rejections. The BH threshold corresponds 
to the last p- value that falls under the line with slope a. This leads to two 
hypotheses being rejected in this case. ■ 

10.28 Example. Suppose that 10 independent hypothesis tests are carried 
leading to the following ordered p-values: 

0.00017 0.00448 0.00671 0.00907 0.01220 
0.33626 0.39341 0.53882 0.58125 0.98617 
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FIGURE 10.6. The Benjamini-Hochberg (BH) procedure. For uncorrected testing 
we reject when Pi < a. For Bonferroni testing we reject when Pi < a/m. The BH 
procedure rejects when Pi < T. The BH threshold T corresponds to the rightmost 
undercrossing of the upward sloping line. 

With a = 0.05, the Bonferroni test rejects any hypothesis whose p- value is 
less than ce/10 = 0.005. Thus, only the first two hypotheses are rejected. For 
the BH test, we find the largest i such that < ia/m, which in this case is 
i = 5. Thus we reject the first five hypotheses. ■ 



10.8 Goodness-of-fit Tests 

There is another situation where testing arises, namely, when we want to check 
whether the data come from an assumed parametric model. There are many 
such tests; here is one. 

Let ^ = {f{x; 0) : 0 G 0} be a parametric model. Suppose the data take 
values on the real line. Divide the line into k disjoint intervals /i, . . . , 7/^. For 
j = l,...,/c, let 

Pji^) = / f{x-,e)dx 

Jl;, 

be the probability that an observation falls into interval Ij under the assumed 
model. Here, 0 = (6>i, . . . , 6>s) are the parameters in the assumed model. Let 
Nj be the number of observations that fall into Ij . The likelihood for 6 based 
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on the counts iVi , . . . , Nk is the multinomial likelihood 

k 

J = 1 



Maximizing Q{0) yields estimates 0 = ... ^Og) of 0. Now dehne the test 

statistic 



e = E 

j=i 



(Nj - np^(e)y 
npj{0) 



(10.9) 



10.29 Theorem. Let Hq be the null hypothesis that the data are llD draws from 
the model ^ = {f{x;0) : 0 G ©}. Under H — f), the statistie Q defined in 
equation (10.9) eonverges in distribution to a Xk-i-s ^^L'^dom variable. Thus, 
the (approximate) p-value for the test is ^{Xk-i-g > q) where q denotes the 
observed value of Q. 



It is tempting to replace 0 in (10.9) with the mle 0. However, this will not 
result in a statistic whose limiting distribution is a Xk-i-s' However, it can 
be shown — due to a theorem of Herman Chernoff and Erich Lehmann from 
1954 — that the p-value is bounded approximately by the p- values obtained 
using a xl-i-s and a xl-v 

Goodness-of-ht testing has some serious limitations. If reject Hq then we 
conclude we should not use the model. But if we do not reject Hq we can- 
not conclude that the model is correct. We may have failed to reject simply 
because the test did not have enough power. This is why it is better to use 
nonparametric methods whenever possible rather than relying on parametric 
assumptions. 



10.9 Bibliographic Remarks 

The most complete book on testing is Lehmann (1986). See also Chapter 8 of 
Casella and Berger (2002) and Chapter 9 of Rice (1995). The FDR method is 
due to Benjamini and Hochberg (1995). Some of the exercises are from Rice 
(1995). 




170 



10. Hypothesis Testing and p- values 



10.10 Appendix 

10.10.1 The Neyman- Pears on Lemma 

In the special case of a simple null Hq : 0 = Oq and a simple alternative 
Hi : 0 = 0i we can say precisely what the most powerful test is. 

10.30 Theorem (Neyman-Pearson). Suppose we test Hq : 0 = Oq versus Hi : 
6 = 6i. Let 

^ £(6>i) ^ nr=i/(^»^i) 

L{0o) nr=i/(^i;^o)' 

Suppose we reject Hq when T > k. If we choose k so that ^eo{T > k) = a 
then this test is the most powerful, size a test. That is, among all tests with 
size a, this test maximizes the power f3{0i). 



10.10.2 The t-test 

To test i^o : /i = /io where /i = E(X^) is the mean, we can use the Wald test. 
When the data are assumed to be Normal and the sample size is small, it is 
common instead to use the t-test. A random variable T has a t- distribution 
with k degrees of freedom if it has density 




When the degrees of freedom k ^ oo, this tends to a Normal distribution. 
When k = 1 it reduces to a Cauchy. 

Let Xi, . . . , Xn ^ X(/i, cr^) where 0 = (/i, cr^) are both unknown. Suppose 
we want to test /i = /ip versus /a ^ /iq. Let 

^ Xh(X„ - Ho) 

Sn 

where is the sample variance. For large samples T ss V(0, 1) under Hq. 
The exact distribution of T under Hq is tn-i- Hence if we reject when |T| > 
^n-i,a/2 then we get a size a test. However, when n is moderately large, the 
t-test is essentially identical to the Wald test. 



10.11 Exercises 



1. Prove Theorem 10.6. 
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2. Prove Theorem 10.14. 

3. Prove Theorem 10.10. 

4. Prove Theorem 10.12. 

5. Let Xi, ...,X^ ^ Uniform(0, 6>) and let Y = max{Xi, ...,X^}. We want 
to test 

Hq : 0 = 1/2 versus Hi : 0 > 1/2. 

The Wald test is not appropriate since Y does not converge to a Normal. 
Suppose we decide to test this hypothesis by rejecting Hq when Y > c. 

(a) Find the power function. 

(b) What choice of c will make the size of the test .05? 

(c) In a sample of size n = 20 with Y=0.48 what is the p-value? What 
conclusion about Hq would you make? 

(d) In a sample of size n = 20 with Y=0.52 what is the p-value? What 
conclusion about Hq would you make? 

6. There is a theory that people can postpone their death until after an 
important event. To test the theory, Phillips and King (1988) collected 
data on deaths around the Jewish holiday Passover. Of 1919 deaths, 922 
died the week before the holiday and 997 died the week after. Think of 
this as a binomial and test the null hypothesis that 6 = 1/2. Report and 
interpret the p-value. Also construct a conhdence interval for 6. 

7. In 1861, 10 essays appeared in the New Orleans Daily Crescent They 
were signed “Quintus Curtius Snodgrass” and some people suspected 
they were actually written by Mark Twain. To investigate this, we will 
consider the proportion of three letter words found in an author’s work. 
From eight Twain essays we have: 

.225 .262 .217 .240 .230 .229 .235 .217 
From 10 Snodgrass essays we have: 

.209 .205 .196 .210 .202 .207 .224 .223 .220 .201 

(a) Perform a Wald test for equality of the means. Use the nonpar amet- 
ric plug-in estimator. Report the p-value and a 95 per cent conhdence 
interval for the difference of means. What do you conclude? 

(b) Now use a permutation test to avoid the use of large sample methods. 
What is your conclusion? (Brinegar (1963)). 
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8. Let Xi, . . . , Xn ^ 1). Consider testing 

iLo • ^ = 0 versus 6 = 1. 

Let the rejection region he R = {x'^ : T(x^) > c} where T{x'^) = 

(a) Find c so that the test has size a. 

(b) Find the power under iLi, that is, find /3(1). 

(c) Show that /3(1) ^ 1 as n ^ oo. 

9. Let 0 be the mle of a parameter 6 and let ^ where I{6) 

is the Fisher information. Consider testing 

Hq : 9 = 9o versus 9 ^ Oq. 

Consider the Wald test with rejection region R = {x^ : |Z| > Zc,/2} 

where Z = (9 — 9o)/se. Let 9i > 9 q be some alternative. Show that 

/5(^i) ^ 1- 

10. Here are the number of elderly Jewish and Chinese women who died 
just before and after the Chinese Harvest Moon Festival. 



Week 


Chinese 


Jewish 


-2 


55 


141 


-1 


33 


145 


1 


70 


139 


2 


49 


161 



Compare the two mortality patterns. (Phillips and Smith (1990)). 

11. A randomized, double-blind experiment was conducted to assess the 
effectiveness of several drugs for reducing postoperative nausea. The 
data are as follows. 





Number of Patients 


Incidence of Nausea 


Placebo 


80 


45 


Chlorpromazine 


75 


26 


Dimenhydrinate 


85 


52 


Pentobarbital (100 mg) 


67 


35 


Pentobarbital (150 mg) 


85 


37 
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(a) Test each drug versus the placebo at the 5 per cent level. Also, report 
the estimated odds-ratios. Summarize your Endings. 

(b) Use the Bonferroni and the FDR method to adjust for multiple 
testing. (Beecher (1959)). 

12 . Let Xi, ..., ^ Poisson(A). 

(a) Let Ao > 0. Find the size a Wald test for 

Ho : A = Ao versus Hi : A 7 ^ Aq. 



(b) (Computer Experiment.) Let Aq = 1 , n = 20 and a = .05. Simulate 
Xi,...,X^ ^ Poisson(Ao) and perform the Wald test. Repeat many 
times and count how often you reject the null. How close is the type I 
error rate to .05? 



13. Let Xi, . . . , Xn ^ X(/i, cr^). Construct the likelihood ratio test for 

Hq : = fiQ versus iLi : /i 7 ^ /io- 

Compare to the Wald test. 

14. Let Xi, . . . , Xn ^ X(/i, cr^). Construct the likelihood ratio test for 

Hq : a = ao versus Hi : a ^ ao. 

Compare to the Wald test. 

15. Let X ^ Binomial(n,p). Construct the likelihood ratio test for 

Ho • p = Po versus Hi : p 7 ^ Po- 



Compare to the Wald test. 

16. Let be a scalar parameter and suppose we test 

Ho : 0 = Oo versus Hi : 0 ^ 6 >o- 

Let W be the Wald test statistic and let A be the likelihood ratio test 
statistic. Show that these tests are equivalent in the sense that 

— Ai 
A 

as n ^ 00 . Hint: Use a Taylor expansion of the log-likelihood £{0) to 
show that 



A - 9o) 



n 



2 
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11.1 The Bayesian Philosophy 

The statistical methods that we have discussed so far are known as frequen- 
tist (or classical) methods. The frequentist point of view is based on the 
following postulates: 



FI Probability refers to limiting relative frequencies. Probabilities are ob- 
jective properties of the real world. 

F2 Parameters are hxed, unknown constants. Because they are not fluctu- 
ating, no useful probability statements can be made about parameters. 

F3 Statistical procedures should be designed to have well-dehned long run 
frequency properties. For example, a 95 percent conhdence interval should 
trap the true value of the parameter with limiting frequency at least 95 
percent. 



There is another approach to inference called Bayesian inference. The 
Bayesian approach is based on the following postulates: 
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B1 Probability describes degree of belief, not limiting frequency. As such, 
we can make probability statements about lots of things, not just data 
which are subject to random variation. For example, I might say that 
“the probability that Albert Einstein drank a cup of tea on August 1, 
1948” is .35. This does not refer to any limiting frequency. It reflects my 
strength of belief that the proposition is true. 

B2 We can make probability statements about parameters, even though 
they are fixed constants. 

B3 We make inferences about a parameter 0 by producing a probability 
distribution for 0. Inferences, such as point estimates and interval esti- 
mates, may then be extracted from this distribution. 



Bayesian inference is a controversial approach because it inherently em- 
braces a subjective notion of probability. In general, Bayesian methods pro- 
vide no guarantees on long run performance. The held of statistics puts more 
emphasis on frequentist methods although Bayesian methods certainly have 
a presence. Certain data mining and machine learning communities seem to 
embrace Bayesian methods very strongly. Let’s put aside philosophical ar- 
guments for now and see how Bayesian inference is done. We’ll conclude this 
chapter with some discussion on the strengths and weaknesses of the Bayesian 
approach. 



11.2 The Bayesian Method 

Bayesian inference is usually carried out in the following way. 

1. We choose a probability density f{6) — called the prior distribution 
— that expresses our beliefs about a parameter 0 before we see any 
data. 

2. We choose a statistical model f{x\6) that reflects our beliefs about x 
given 6. Notice that we now write this as f{x\6) instead of f{x; 6). 

3. After observing data Xi, . . . ,X^, we update our beliefs and calculate 
the posterior distribution /(6>|Xi, . . . , X^). 



To see how the third step is carried out, hrst suppose that 0 is discrete and 
that there is a single, discrete observation X. We should use a capital letter 
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now to denote the parameter since we are treating it like a random variable, 
so let © denote the parameter. Now, in this discrete setting. 



P(0 = ^|X = x) 



P(X = x,0 = 0) 

P(X = x) 

P(X = x|0 = ^)P(0 = ^) 
^^P(X = x|0 = 6>)P(0 = 6>) 



which you may recognize from Chapter 1 as Bayes’ theorem. The version 
for continuous variables is obtained by using density functions: 



f{0\x) 



fjx\0)f{0) 

J f{x\0)f{0)d0- 



( 11 . 1 ) 



If we have n IID observations Xi, . . . , X^, we replace f{x\0) with 



n 

f{xi,...,Xn\0) =I1'^(^»I^) =Cn(0)- 

i=l 



Notation. We will write X^ to mean (Xi, . . . , X^) and to mean (xi, . . . , 
Now, 



f{0\x^) 



f{x^\0)f{0) 
f f{x-\e)fie)de 



^n{0)f{0) 



OC Cn{0)f{0) 



( 11 . 2 ) 



where 

Cn = j Cn{0)f{0)d0 (11.3) 

is called the normalizing constant. Note that does not depend on 0. We 
can summarize by writing: 



Posterior is proportional to Likelihood times Prior 



or, in symbols. 



f{0\x^) ^ mm. 



You might wonder, doesn’t it cause a problem to throw away the constant 
Cn? The answer is that we can always recover the constant later if we need to. 

What do we do with the posterior distribution? First, we can get a point 
estimate by summarizing the center of the posterior. Typically, we use the 
mean or mode of the posterior. The posterior mean is 



On = 



J 0fi0\^ 






J0m0)m 

JCmf{0)d0' 



(11.4) 
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We can also obtain a Bayesian interval estimate. We find a and b such that 
f-oo = fiT f(0\x^)dO = a/2. Let C = (a, b). Then 



P(6»ee|a;")= f f{e\x^)dO = l-a 

J a 

SO C is a 1 — ch posterior interval. 



11.1 Example. Let Xi, . . . ^ Bernoulli(p). Suppose we take the uniform 

distribution /(p) = 1 as a prior. By Bayes’ theorem, the posterior has the 
form 



/(p|x”) « f{p)Cn{p) =f{l- pT-^ = f + + 



where s = XlILi number of successes. Recall that a random variable 

has a Beta distribution with parameters a and (3 if its density is 

We see that the posterior for p is a Beta distribution with parameters 5 + 1 
and n — 5 + 1. That is, 



f(p\xn = 



r(n + 2) 

r(s + i)r(n — s + 1) 



(s+l)-l 



(1-p) 



(n— s+1) — 1 



We write this as 

p\x'^ ^ Beta(s + 1, n — s + 1). 

Notice that we have figured out the normalizing constant without actually 
doing the integral f Cri{p)f{p)dp. The mean of a Beta(o,/3) distribution is 
a/{a (3) so the Bayes estimator is 



p = 



5 + 1 
77 / + 2 



(11.5) 



It is instructive to rewrite the estimator as 



p = AnP+ (1 - Xn)p (11.6) 

where p = s/n is the mle, p = 1/2 is the prior mean and = n/{n-\- 2) ^ 1. 
A 95 percent posterior interval can be obtained by numerically finding a and 
b such that f{p\x^) dp = .95. 

Suppose that instead of a uniform prior, we use the prior p ^ Beta((a,/3). 
If you repeat the calculations above, you will see that p\x^ ^ Beta((a + 5, /3 + 
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n — s). The flat prior is just the special case with a = (3 = 
mean is 



P = 



tr T 5 
OL -\~ (3 ~\~ Ti 



n 

OL -\~ (3 ~\~ Ti 



P + 



(y, -\- [3 
(y. (3 Ti 



1. The posterior 
j Po 



where po = a/ {a (3) is the prior mean. ■ 



In the previous example, the prior was a Beta distribution and the posterior 
was a Beta distribution. When the prior and the posterior are in the same 
family, we say that the prior is conjugate with respect to the model. 

11.2 Example. Let Xi, . . . , Xn ^ cr^). For simplicity, let us assume that 

a is known. Suppose we take as a prior 0 ^ N{a^b‘^). In problem 1 in the 
exercises it is shown that the posterior for 0 is 

0\X^ - (11.7) 



where 



6 = wX + (1 — te)a. 



w = 



J_ 1 

se2 -L 



1 

se2 



se^ 



1 



and se = a j ^/n is the standard error of the mle X. This is another example 
of a conjugate prior. Note that w ^ \ and r/se ^ 1 as n ^ oo. So, for large 
n, the posterior is approximately X(6>, se^). The same is true if n is hxed but 
6 ^ oo, which corresponds to letting the prior become very flat. 

Continuing with this example, let us hnd C = (c, d) such that ¥{0 G 
C\X^) = .95. We can do this by choosing c and d such that < c|X^) = 
.025 and ¥{0 > d\X^) = .025. So, we want to hnd c such that 



< c|X^) = 



0-0 c-0 

< 






Z < ^ 1 = .025. 



We know that ¥{Z < —1.96) = .025. So, 

c — 0 



= -1.96 



implying that c = 6^-1. 96r. By similar arguments, d = 6^+1. 96. So a 95 percent 
Bayesian interval is 0±1.96r. Since 0^0 and r se, the 95 percent Bayesian 
interval is approximated by ± 1.96se which is the frequent ist conhdence 
interval. ■ 
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11.3 Functions of Parameters 



How do we make inferences about a function r = g{0)7 Remember in Chapter 
3 we solved the following problem: given the density fx for X, hnd the density 
for y = g{X). We now simply apply the same reasoning. The posterior CDF 
for T is 

H{t\x^) = F{g{0) < r\x^) = [ f{0\x^)d0 

Ja 

where A = {0 : g{6) < r}. The posterior density is h{r\x'^) = H\t\x'^). 

11.3 Example. Let Xi, . . . ,X^ ^ Bernoulli(p) and f{p) = 1 so that p|X^ ^ 
Beta(5 + 1, n — 5 + 1) with 5 = XlILi ~ log(p/ (1 ~ P))- Then 



H{^Ij\x^) = P(T < = P log 



l-P 



< ip 



P < 






I 



l + e^ 

K 

e't’/il+e'l') 



f{p\x^) dp 



r(n + 2) 



r(s + i)r(n - s + 1) 



L 






p®(l - p)" ® dp 



and 



r(7r T 2) 1 


( e’^ 


T(s + l)T(n — s + 1) ' 


O + e’^ 


r(7r T 2) 1 


( e* 


T(s + l)T(n — s + 1) ' 


O + e’/' 


r(7r T 2) 1 


( 


T(s + l)T(n — s + 1) ' 


O + e’/' 



1 



1 + e-^ 



-V 



dip 



1 \”“V 1 



1 + y \lPe^ 

^ ^ n-s+2 



1 + 



for G M. 



11.4 Simulation 

The posterior can often be approximated by simulation. Suppose we draw 
Oi^ ... ^Ob ^ p{0\x^). Then a histogram of 6>i, . . . , approximates the poste- 
rior density p{0\x^). An approximation to the posterior mean On =^{0\x^) is 
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Ylf=i ^ 3 ' The posterior 1—a interval can be approximated by ( 6 >a /2 7 ^i-a/ 2 ) 
where 6 >q ./2 is the a/2 sample quantile of 6 >i, . . . , 6 ^^. 

Once we have a sample from f{6\x'^)^ let = g{0i). Then 

Ti, . . . , is a sample from f(r\x'^). This avoids the need to do any analytical 
calculations. Simulation is discussed in more detail in Chapter 24. 

11.4 Example. Consider again Example 11.3. We can approximate the pos- 
terior for ip without doing any calculus. Here are the steps: 

1. Draw Pi, ... , Pb ^ Beta(5 + 1, 77 , — 5 + 1). 

2 . Let ipi = \og{Pi/{l - Pi)) for i = 1, . . . , S. 

Now are IID draws from hpijj\x'^). A histogram of these values 

provides an estimate of hpp\x'^). m 



11.5 Large Sample Properties of Bayes’ Procedures 

In the Bernoulli and Normal examples we saw that the posterior mean was 
close to the mle. This is true in greater generality. 

11.5 Theorem. Let On be the mle and let^ = 1/ \J nl{6n)> Under appropriate 
regularity conditions, the posterior is approximately Normal with mean On cmd 
standard deviation Hence, On ^ On- Also, if Cn = ( 6 >n — ^a/ 2^7 +^a/ 2 ^) 

is the asymptotic frequentist 1 — a confidence interval, then Cn is also an 
approximate 1—a Bayesian posterior interval: 

¥{0 e Cn\X^) ^ 1 - a. 

There is also a Bayesian delta method. Let r = g{0). Then 

r|X” w 

where r = g{0) and ^ ^ \g'{0)\. 



11.6 Flat Priors, Improper Priors, and 
“Noninformative” Priors 

An important question in Bayesian inference is: where does one get the prior 
f{0)7 One school of thought, called subjectivism says that the prior should 
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reflect our subjective opinion about 6 (before the data are collected). This may 
be possible in some cases but is impractical in complicated problems especially 
if there are many parameters. Moreover, injecting subjective opinion into the 
analysis is contrary to the goal of making scientihc inference as objective 
as possible. An alternative is to try to dehne some sort of “noninformative 
prior.” An obvious candidate for a noninformative prior is to use a flat prior 
f{6) oc constant. 

In the Bernoulli example, taking /(p) = 1 leads to p\X^ ^ Beta(s + 1, n — 
5 + 1) as we saw earlier, which seemed very reasonable. But unfettered use of 
flat priors raises some questions. 

Improper Priors. Let X ^ N(0^a‘^) with a known. Suppose we adopt 
a flat prior f{6) oc c where c > 0 is a constant. Note that J f{0)d0 = oo so 
this is not a probability density in the usual sense. We call such a prior an 
improper prior. Nonetheless, we can still formally carry out Bayes’ theorem 
and compute the posterior density by multiplying the prior and the likelihood: 
f{0) oc Cn{6)f{6) oc Cn{0). This gives 6\X^ ^ N{X,a‘^/n) and the resulting 
point and interval estimators agree exactly with their frequentist counterparts. 
In general, improper priors are not a problem as long as the resulting posterior 
is a well-dehned probability distribution. 

Flat Priors are Not Invariant. Let X ^ Bernoulli (p) and suppose we 
use the flat prior /(p) = 1. This flat prior presumably represents our lack of 
information about p before the experiment. Now let ip = log(p/(l — p)). This 
is a transformation of p and we can compute the resulting distribution for ^0, 
namely, 

A(V’) - gV>)2 

which is not flat. But if we are ignorant about p then we are also ignorant 
about ip so we should use a flat prior for ip. This is a contradiction. In short, 
the notion of a flat prior is not well dehned because a flat prior on a parameter 
does not imply a flat prior on a transformed version of the parameter. Flat 
priors are not transformation invariant. 

Jeffreys’ Prior. Jeffreys came up with a rule for creating priors. The 
rule is: take 

/(0) (X 7(0)1/2 

where I{0) is the Fisher information function. This rule turns out to be trans- 
formation invariant. There are various reasons for thinking that this prior 
might be a useful prior but we will not go into details here. 
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11.6 Example. Consider the Bernoulli (p) model. Recall that 



Hp) 



1 



Jeffreys’ rule says to use the prior 



f{p) VHp) = P 



This is a Beta (1/2, 1/2) density. This is very close to a uniform density. ■ 

In a multiparameter problem, the Jeffreys’ prior is dehned to be f(0) oc 
^J\I{9)\ where \A\ denotes the determinant of a matrix A and I {9) is the 
Fisher information matrix. 



11.7 Multiparameter Problems 

Suppose that 0 = (6>i, . . . , 6>p). The posterior density is still given by 

f{6\x^) (X Cn{e)f{9). (11.8) 

The question now arises of how to extract inferences about one parameter. 
The key is to hnd the marginal posterior density for the parameter of interest. 
Suppose we want to make inferences about 6i. The marginal posterior for 6i 
is 

/(0i|x”) = I ■■■ I f{ei,---,0p\xnd02-.-d0p. (11.9) 

In practice, it might not be feasible to do this integral. Simulation can help. 
Draw randomly from the posterior: 

where the superscripts index the different draws. Each 0^ is a vector 0^ = 
(6>^, . . . , 6>^). Now collect together the first component of each draw: 

These are a sample from f{0i\x^) and we have avoided doing any integrals. 

11.7 Example (Comparing Two Binomials). Suppose we have ni control pa- 
tients and 772 treatment patients and that Xi control patients survive while 
X 2 treatment patients survive. We want to estimate r = g{pi^P 2 ) = P 2 ~ Pi- 
Then, 

Xi ^ Binomial(77i,pi) and X 2 ^ Binomial(772,P2)- 
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If /(pi,P2) = 1 , the posterior is 

/(pi,p2|xi,x 2) «pr(i 

Notice that (pi,P2) hve on a rectangle (a square, actually) and that 
f{Pl,P2\xi,X2) = f{.Pl\xi)f{p2\x2) 

where 

/(pil^i) (xpf ( 1 and /(P2k2) 0CP2"(1 

which implies that pi and p2 are independent under the posterior. Also, 
Pi\xi ^ Beta(xi + l,ni — + 1 ) and P2|^2 ^ Beta(x2 + l,n2 — ^2 + 1 ). 

If we simulate Pip, . . . , P\^b ^ Beta(xi + 1 , ni — xi + 1 ) and P2,i, • • • , P2,b ^ 
Beta(x2 + 1 , n2 — X2 + 1 ), then = P2,6 — -Pi, 6? ^ = 1 , . . . , P, is a sample from 
/(r|xi,X2). ■ 



11.8 Bayesian Testing 



Hypothesis testing from a Bayesian point of view is a complex topic. We 
will only give a brief sketch of the main idea here. The Bayesian approach 
to testing involves putting a prior on Pq and on the parameter 6 and then 
computing Consider the case where 6 is scalar and we are testing 

Hq : 0 = Oq versus Hi : 0 ^ Oq. 



It is usually reasonable to use the prior P(Po) = T(Pi) = 1/2 (although this 
is not essential in what follows). Under Hi we need a prior for 0. Denote this 
prior density by f{0). From Bayes’ theorem 






/(x»|go)P(go) 

y{x^ I ^0) 

^f{x^ I ^0) + y{x^ I Hi) 

f{x^\ do) 

/(x« I 6o) + Jfix^\ 0)f{9)d9 
C(9o) 

£(9o) + f £(9)f{9)d9' 



We saw that, in estimation problems, the prior was not very influential and 
that the frequentist and Bayesian methods gave similar answers. This is not 
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the case in hypothesis testing. Also, one can’t use improper priors in testing 
because this leads to an undefined constant in the denominator of the expres- 
sion above. Thus, if you use Bayesian testing you must choose the prior f{0) 
very carefully. It is possible to get a prior-free bound on F(Hq\X'^ = x^). 
Notice that 0 < f C{6)f{6)d6 < C{6). Hence, 



c{e^) 

C{0^) + C(0) 






x^) < 1 . 



The upper bound is not very interesting, but the lower bound is non-trivial. 



11.9 Strengths and Weaknesses of Bayesian Inference 



Bayesian inference is appealing when prior information is available since Bayes’ 
theorem is a natural way to combine prior information with data. Some peo- 
ple find Bayesian inference psychologically appealing because it allows us to 
make probability statements about parameters. In contrast, frequentist infer- 
ence provides confidence sets Cn which trap the parameter 95 percent of the 
time, but we cannot say that P(6> G Cn\X^) is .95. In the frequentist approach 
we can make probability statements about not 6. However, psychological 
appeal is not a compelling scientific argument for using one type of inference 
over another. 

In parametric models, with large samples, Bayesian and frequentist methods 
give approximately the same inferences. In general, they need not agree. 

Here are three examples that illustrate the strengths and weakness of Bayesian 
inference. The first example is Example 6.14 revisited. This example shows 
the psychological appeal of Bayesian inference. The second and third show 
that Bayesian methods can fail. 

11.8 Example (Example 6.14 revisited). We begin by reviewing the example. 
Let 0 he 8i fixed, known real number and let Xi,X 2 be independent random 
variables such that P(X^ = 1) = = —1) = 1/2. Now define Yi = 0 ^ Xi 

and suppose that you only observe Yi and l 2 - Let 



{Ti - 1} if Ti = T2 

{{Y^+Y2)/2} ifY^^Y2. 



This is a 75 percent confidence set since, no matter what d is, F^(d e C) = 3/4. 

Suppose we observe Ti = 15 and T 2 = 17. Then our 75 percent confidence 
interval is {16}. However, we are certain, in this case, that d = 16. So calling 
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this a 75 percent confidence set, bothers many people. Nonetheless, (7 is a 
valid 75 percent confidence set. It will trap the true value 75 percent of the 
time. 

The Bayesian solution is more satisfying to many. For simplicity, assume 
that 0 is an integer. Let f{0) be a prior mass function such that f{6) > 0 for 
every integer 6. When Y = (yi,T 2 ) = (15, 17), the likelihood function is 

= ottolse. 

Applying Bayes’ theorem we see that 

P(e = »|r = (i5,i7)) = { ; 

Hence, G C\Y = (15, 17)) = 1. There is nothing wrong with saying that 
{16} is a 75 percent confidence interval. But is it not a probability statement 
about 0. m 

11.9 Example. This is a simplified version of the example in Robins and Ritov 
(1997). The data consist of n IID triples 

(Xi,Ri,yi),...,(x,,y,,R^). 

Let 5 be a finite but very large number, like B = 100^^^. Any realistic sample 
size n will be small compared to B. Let 

6 = (6>i, . . . , 9 b ) 

be a vector of unknown parameters such that 0 < < 1 for 1 < j < H. Let 

i = ( 6 , • • • ,<^ b ) 



be a vector of known numbers such that 



0 < 5 < < 1 - 5 < 1, l<j<B, 

where 5 is some, small, positive number. Each data point (X^, Ri^Yi) is drawn 
in the following way: 

1. Draw Xi uniformly from {1, . . . , 5}. 

2. Draw Ri ^ Bernoulli((fxJ- 

3. If Ri = 1, then draw Yi ^ Bernoulli (^xj- If Ri = 0, do not draw Yi. 
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The model may seem a little artificial but, in fact, it is caricature of some 
real missing data problems in which some data points are not observed. In 
this example, = 0 can be thought of as meaning “missing.” Our goal is to 
estimate 

= p(y, = 1). 



Note that 



B 

tA = p(y, = i) = ^p(F, = i|x = j)P(x=j) 

J = 1 

J = 1 

SO = g{6) is a function of 6. 

Let us consider a Bayesian analysis first. The likelihood of a single obser- 
vation is 

The last term is raised to the power Ri since, if = 0, then Yi is not observed 
and hence that term drops out of the likelihood. Since f{Xi) = 1/B and that 
Yi and Ri are Bernoulli, 

Thus, the likelihood function is 

n 

m = l[f{X,)f{R,\X,)f{Y,\X,f^ 

i=l 

= fl i ^ 5 : (1 - (1 - 

We have dropped all the terms involving B and the (fj’s since these are known 
constants, not parameters. The log-likelihood is 

n 

m = ^yii?iiog^x. + (i-ri)i?, iog(i-0xj 

B B 

= Y V' log ^3 + Y l°s(l - ^j) 

J=1 J=1 
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where 

rij = : Yi = l,Ri = l,Xi= j} 

ruj = Yi = 0,Ri = l,Xi = j}. 



Now, rij = rrij = 0 for most j since B is so much larger than n. This has 
several implications. First, the mle for most Oj is not dehned. Second, for 
most Oj, the posterior distribution is equal to the prior distribution, since 
those Oj do not appear in the likelihood. Hence, /(6>|Data) ~ /(^)- It follows 
that /(V^ I Data) /(V^)- In other words, the data provide little information 

about ^0 in a Bayesian analysis. 

Now we consider a frequentist solution. Dehne 






1 - ^ 
n ^ ex. 



( 11 . 10 ) 



We will now show that this estimator is unbiased and has small mean-squared 
error. It can be shown (see Exercise 7) that 

E('0) = ip and V('0) < (11.11) 

no^ 

Therefore, the MSE is of order 1 /n which goes to 0 fairly quickly as we collect 
more data, no matter how large B is. The estimator dehned in (11.10) is called 
the Horwitz-Thompson estimator. It cannot be derived from a Bayesian or 
likelihood point of view since it involves the terms These terms drop 
out of the log-likelihood and hence will not show up in any likelihood-based 
method including Bayesian estimators. 

The moral of the story is this. Bayesian methods are tied to the likeli- 
hood function. But in high dimensional (and nonparametric) problems, the 
likelihood may not yield accurate inferences. ■ 



11.10 Example. Suppose that / is a probability density function and that 



fix) = cg{x) 

where g(x) > 0 is a known function and c is unknown. In principle we can 
compute c since j f{x)dx — 1 implies that c—\j f g(x) dx. But in many cases 
we can’t do the integral f g(x) dx since g might be a complicated function and 
X could be high dimensional. Despite the fact that c is not known, it is often 
possible to draw a sample Xi, . . . , from /; see Chapter 24. Can we use the 
sample to estimate the normalizing constant c? Here is a frequentist solution: 
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Let fn{x) be a consistent estimate of the density /. Chapter 20 explains how to 
construct such an estimate. Choose any point x and note that c = f{x)/g{x). 
Hence, c = f(x)/g(x) is a consistent estimate of c. Now let us try to solve this 
problem from a Bayesian approach. Let 7t(c) be a prior such that 7t(c) > 0 for 
all c > 0. The likelihood function is 

n n n 

^n(c) = n = n ^9{Xi) = C" R g{Xi) « c”. 

i=l i=l i=l 

Hence the posterior is proportional to c^7t(c). The posterior does not depend 
on Xi, . . . , so we come to the startling conclusion that, from the Bayesian 
point of view, there is no information in the data about c. Moreover, the 
posterior mean is 

/y C»+^7T(c)(jc 
/y C«7 t(c) dc 

which tends to inhnity as n increases. ■ 

These last two examples illustrate an important point. Bayesians are slaves 
to the likelihood function. When the likelihood goes awry, so will Bayesian 
inference. 

What should we conclude from all this? The important thing is to under- 
stand that frequent ist and Bayesian methods are answering different ques- 
tions. To combine prior beliefs with data in a principled way, use Bayesian in- 
ference. To construct procedures with guaranteed long run performance, such 
as confidence intervals, use frequentist methods. Generally, Bayesian methods 
run into problems when the parameter space is high dimensional. In particu- 
lar, 95 percent posterior intervals need not contain the true value 95 percent 
of the time (in the frequency sense). 



11.10 Bibliographic Remarks 

Some references on Bayesian inference include Carlin and Louis (1996), Gel- 
man et al. (1995), Lee (1997), Robert (1994), and Schervish (1995). See Cox 
(1993), Diaconis and Freedman (1986), Freedman (1999), Barron et al. (1999), 
Ghosal et al. (2000), Shen and Wasserman (2001), and Zhao (2000) for discus- 
sions of some of the technicalities of nonparametric Bayesian inference. The 
Robins- Ritov example is discussed in detail in Robins and Ritov (1997) where 
it is cast more properly as a nonparametric problem. Example 11.10 is due to 
Edward George (personal communication). See Berger and Delampady (1987) 
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and Kass and Raftery (1995) for a discussion of Bayesian testing. See Kass 
and Wasserman (1996) for a discussion of noninformative priors. 



11.11 Appendix 



Proof of Theorem 11.5. 

It can be shown that the effect of the prior diminishes as n increases so 
that f{0\X^) oc Cn{0)f{0) ^ Cn{0). Hence, log/((9|X^) l{6). Now, l{6) ^ 

i(0) + (^ - 0)i'(0) + [(<9 - 0)^/2]i''(0) = £(0) + [{0 - 0)y2]r(0) since £'(0) = 0. 
Exponentiating, we get approximately that 



where So the posterior of 0 is approximately Normal with 

mean 6 and variance cr^. Let £i = log/(X^|6>), then 



1 



(7 



2 

n 



-e"{9n) = Y. -ei{9n) 

i 

n(i)^-dd^n)«^Ee 

^ ^ i 

nl{0n) 




and hence ~ se{0). m 



11.12 Exercises 

1. Verify (11.7). 

2. Let Xi, ..., Xn ^ Normal(/i, 1). 

(a) Simulate a data set (using /i = 5) consisting of n=100 observations. 

(b) Take /(/i) = 1 and hnd the posterior density. Plot the density. 

(c) Simulate 1,000 draws from the posterior. Plot a histogram of the 
simulated values and compare the histogram to the answer in (b). 

(d) Let 0 = e^. Find the posterior density for 0 analytically and by 
simulation. 

(e) Find a 95 percent posterior interval for /i. 

(f) Find a 95 percent conhdence interval for 0. 
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3. Let ^ Uniform(0, 6>). Let f{0) oc 1/0. Find the posterior 

density. 

4. Suppose that 50 people are given a placebo and 50 are given a new 
treatment. 30 placebo patients show improvement while 40 treated pa- 
tients show improvement. Let r = p 2 — Pi where p 2 is the probability of 
improving under treatment and pi is the probability of improving under 
placebo. 

(a) Find the mle of r. Find the standard error and 90 percent conhdence 
interval using the delta method. 

(b) Find the standard error and 90 percent conhdence interval using the 
parametric bootstrap. 

(c) Use the prior /(pi,P 2 ) = 1- Use simulation to hnd the posterior 
mean and posterior 90 percent interval for r. 

(d) Let 

be the log-odds ratio. Note that ^0 = 0 if pi = P 2 - Find the mle of ijj. 
Use the delta method to hnd a 90 percent conhdence interval for 

(e) Use simulation to hnd the posterior mean and posterior 90 percent 
interval for ^|J. 

5. Consider the Bernoulli(p) observations 

0101000000 

Plot the posterior for p using these priors: Beta(l/2,l/2), Beta(l,l), 
Beta(10,10), Beta(100,100). 

6. Let Xi, . . . , Xn ^ Poisson(A). 

(a) Let A ^ Gamma((a, f3) be the prior. Show that the posterior is also 
a Gamma. Find the posterior mean. 

(b) Find the Jeffreys’ prior. Find the posterior. 

7. In Example 11.9, verify (11.11). 

8. Let X ^ X(/i, 1). Consider testing 



Hq : p = 0 versus Hi : p ^ 0. 
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Take P(i^o) = = 1/2. Let the prior for fi under iLi be /i ^ 

N{0,b‘^). Find an expression for P(iLo|-^ = x). Compare P(iLo|-^ = x) 
to the p- value of the Wald test. Do the comparison numerically for a 
variety of values of x and b. Now repeat the problem using a sample of 
size n. You will see that the posterior probability of Hq can be large even 
when the p- value is small, especially when n is large. This disagreement 
between Bayesian and frequentist testing is called the Jeffreys-Lindley 
paradox. 
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Statistical Decision Theory 



12.1 Preliminaries 

We have considered several point estimators such as the maximum likelihood 
estimator, the method of moments estimator, and the posterior mean. In fact, 
there are many other ways to generate estimators. How do we choose among 
them? The answer is found in decision theory which is a formal theory for 
comparing statistical procedures. 

Consider a parameter 6 which lives in a parameter space 0. Let 6 be an 
estimator of 6>. In the language of decision theory, an estimator is sometimes 
called a decision rule and the possible values of the decision rule are called 

actions. 

We shall measure the discrepancy between 6 and 6 using a loss function 
L{6^6). Formally, L maps 0x0 into R. Here are some examples of loss 
functions: 

L(6>, 0) = {0 — 0)‘^ squared error loss, 

L{6^ 0) = \0 — 6\ absolute error loss, 

L{0,d) = \0-0\P Lp loss, 

L(6>, 0) = 0 if 0 = 0 OT 1 if 0 ^ 0 zero-one loss, 

L(O^O) = / log f{x; 0)dx Kullback-Leibler loss. 

V J Cj / 




194 



12. Statistical Decision Theory 



Bear in mind in what follows that an estimator is a function of the data. 
To emphasize this point, sometimes we will write 6 as 9{X). To assess an 
estimator, we evaluate the average loss or risk. 



12.1 Definition. The risk of an estimator 0 is 

R{0,e) =Eo(^L{0,e)^ = j L{0,d{x))f{x-,0)dx. 



When the loss function is squared error, the risk is just the MSE (mean 
squared error): 

R{6, 0) Of = MSE = V^(^) + bias^(^). 

In the rest of the chapter, if we do not state what loss function we are using, 
assume the loss function is squared error. 



12.2 Comparing Risk Functions 

To compare two estimators we can compare their risk functions. However, this 
does not provide a clear answer as to which estimator is better. Consider the 
following examples. 

12.2 Example. Let X ^ 1) and assume we are using squared error 

loss. Consider two estimators: 6 i = X and 62 = 3. The risk functions are 
R{9, 9i) = Ee{X - 9)‘^ = 1 and R{9, ^ 2 ) = E^(3 - 9)‘^ = (3 - 9)‘^. If 2 < 6 > < 4 
then R{ 9 ^ 92 ) < R{9^9i)^ otherwise, R{9^9i) < i?( 6 >, 6 > 2 )- Neither estimator 
uniformly dominates the other; see Figure 12.1. ■ 

12.3 Example. Let Xi,...,X^ ~ Bernoulli(p). Consider squared error loss 
and let pi = X. Since this has 0 bias, we have that 

R{p,pi) = nx) = ^XiPl, 
n 

Another estimator is 

^ _ Y 

ce + /3 + n 

where Y = XlILi ^ positive constants. This is the posterior 

mean using a Beta (a, (3) prior. Now, 

R{p,P 2) = ^p(P2) + (biasp(p2))^ 
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0 12 3 4 



FIGURE 12.1. Comparing two risk functions. Neither risk function dominates the 
other at all values of 9. 



= Vp 



Y + a 

(M. -\- /3 -\- Tl 



np{l — p) 
(ce + /3 + n)2 




r + 

ex -\- /3 -\- Tl 



np a 
ex (3 Tl 





Let a = (3 = ^/njA. (In Example 12.12 we will explain this choice.) The 
resulting estimator is 



V2 = , /- 



and the risk function is 

R{j)^P2) “jv I ! — X 2 • 

4(n + ^Jny 

The risk functions are plotted in figure 12.2. As we can see, neither estimator 
uniformly dominates the other. 



These examples highlight the need to be able to compare risk functions. 
To do so, we need a one-number summary of the risk function. Two such 
summaries are the maximum risk and the Bayes risk. 




( 12 . 1 ) 

( 12 . 2 ) 



where f{9) is a prior for 6. 
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R{pi). The dotted line is R{p 2 ). 



12.5 Example. Consider again the two estimators in Example 12.3. We have 



R{Pi) 



max 

0<p<l 



p(i -p) 

n 



1 

4n 



and 

'fi 

R{P 2 ) = max — — ^ ^ 

Based on maximum risk, p 2 is a better estimator since R{p2) < R{pi)- How- 
ever, when n is large, R{pi) has smaller risk except for a small region in the 
parameter space near p = 1/2. Thus, many people prefer pi to p 2 - This il- 
lustrates that one- number summaries like maximum risk are imperfect. Now 
consider the Bayes risk. For illustration, let us take f{p) = 1. Then 

r{f,Pi)= f R{p,pi)dp= 



and 

For n > 20, r(/,p 2 ) > '^{f^Pi) which suggests that pi is a better estimator. 
This might seem intuitively reasonable but this answer depends on the choice 
of prior. The advantage of using maximum risk, despite its problems, is that 
it does not require one to choose a prior. ■ 



These two summaries of the risk function suggest two different methods 
for devising estimators: choosing 0 to minimize the maximum risk leads to 




12.3 Bayes Estimators 
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minimax estimators; choosing 0 to minimize the Bayes risk leads to Bayes 
estimators. 

12.6 Definition. A decision rule that minimizes the Bayes risk is called a 
Bayes rule. Formally, 0 is a Bayes rule with respect to the prior f if 

r{f,0) = inf r{f,0) (12.3) 

e 

where the infimum is over all estimators 0. An estimator that minimizes 
the maximum risk is called a minimax rule. Formally, 6 is minimax if 

sup R{0, 6) = inf sup R{6, 6) (12.4) 

e 0 e 

where the infimum is over all estimators 6. 



12.3 Bayes Estimators 



Let / be a prior. From Bayes’ theorem, the posterior density is 

m\^^ = = m«)m ,12 5 ) 

mix) S Hx\l))f(l»m 

where m{x) = J f{x, 0)d0 = f f(xjd)f(d)dd is the marginal distribution of 
X. Dehne the posterior risk of an estimator d(x) by 

r(0lx) = j L{9,d{x))f{0\x)de. (12.6) 

12.7 Theorem. The Bayes riskr{f,0) satisfies 

r{f,0)= [ r{0\x)m{x) dx. 



Let 6{x) he the value ofO that minimizes r{6\x). Then 0 is the Bayes estimator. 
Proof. We can rewrite the Bayes risk as follows: 



r{f,0) = R{0,0)f{0)d0 = 



L{0,0{x))f{x\0)dx \f{0)d0 



L{0,0{x))f{x,0)dxd0= / / L{0,0{x))f{0\x)m{x)dxd0 



L{0,0{x))f{0\x)d0\m{x)dx= / r{0\x)m{x) dx. 
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If we choose 6{x) to be the value of 6 that minimizes r{6\x) then we will mini- 
mize the integrand at every x and thus minimize the integral / r{6\x)m{x)dx. 



Now we can find an explicit formula for the Bayes estimator for some specific 
loss functions. 



12.8 Theorem. If L{6^6) = {0 — 0)‘^ then the Bayes estimator is 




0f{0\x)d0 = E{0\X 



x). 



(12.7) 



If L{6^ 6) = \6 — 6\ then the Bayes estimator is the median of the posterior 
f{0\x). If L{6^0) is zero-one loss, then the Bayes estimator is the mode of the 
posterior f{0\x). 



Proof. We will prove the theorem for squared error loss. The Bayes rule 
0{x) minimizes r{0\x) = f(0 — 0{x))‘^ f{0\x)d0. Taking the derivative of r{0\x) 
with respect to 0{x) and setting it equal to 0 yields the equation 2 f(0 — 
0{x))f{0\x)d0 = 0. Solving for 0{x) we get 12.7. ■ 



12.9 Example. Let Xi, . . . , ^ where is known. Suppose we 

use a N{a, b‘^) prior for p. The Bayes estimator with respect to squared error 
loss is the posterior mean, which is 



0{X: 



< 



,^n) = 



62 + ^ 



-X 



62 + ^ 



-a. 



12.4 Minimax Rules 



Finding minimax rules is complicated and we cannot attempt a complete 
coverage of that theory here but we will mention a few key results. The main 
message to take away from this section is: Bayes estimators with a constant 
risk function are minimax. 

12.10 Theorem. Let 0^ be the Bayes rule for some prior f: 

r{f,df)=mir{f,e). (12.8) 

e 



Suppose that 

R{e,P) <r{f,P) for all 61. 

Then 9^ is minimax and f is ealled a least favorable prior. 



(12.9) 




12.4 Minimax Rules 



199 



Proof. Suppose that is not minimax. Then there is another rule Oq such 
that sup0 R{0^ Oq) < supQ Since the average of a function is always 

less than or equal to its maximum, we have that r(/, 6>o) ^ supqR{ 0^0 q). 
Hence, 

r{f,0o) < supR{0,9o) < sup R{9, 9^) < 

0 0 

which contradicts (12.8). ■ 



12.11 Theorem. Suppose that 6 is the Bayes rule with respeet to some 
prior /. Suppose further that 0 has constant risk: R{6^ 6) = c for some c. 
Then 0 is minimax. 

Proof. The Bayes risk is r(/, 0) = f R{0^ 6)f(6)d6 = c and hence R(6^ 6) < 
r{f^ 6) for all 6. Now apply the previous theorem. ■ 



12.12 Example. Consider the Bernoulli model with squared error loss. In 
example 12.3 we showed that the estimator 



F(X”) 




has a constant risk function. This estimator is the posterior mean, and hence 
the Bayes rule, for the prior Beta(o, (3) with a = j3 = \fnfi. Hence, by the 
previous theorem, this estimator is minimax. ■ 



12.13 Example. Consider again the Bernoulli but with loss function 






(p - V? 

p(l - v) ' 



Let 

p(X”) = p= ^^=1 ' 

n 

The risk is 



R{p,p) = E 



{p-pY 

p{l-p) 



1 / p{l-p) 

p(l — p) \ n 



1 

n 



which, as a function of p, is constant. It can be shown that, for this loss 
function, p{X^) is the Bayes estimator under the prior /(p) = 1. Hence, p is 
minimax. ■ 



A natural question to ask is: what is the minimax estimator for a Normal 
model? 
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FIGURE 12.3. Risk function for constrained Normal with m=.5. The two short 
dashed lines show the least favorable prior which puts its mass at two points. 

12.14 Theorem. Let Xi, . . . , ^ N{6^ 1) and let 0 = X. Then 6 is minimax 

with respeet to any well-behaved loss funetion. ^ It is the only estimator with 
this property. 

If the parameter space is restricted, then the theorem above does not apply 
as the next example shows. 

12.15 Example. Suppose that X ^ N{6^ 1) and that 6 is known to lie in the 
interval [— m, m] where 0 < m < 1. The unique, minimax estimator under 
squared error loss is 

0{X) = mtanh(mX) 

where tanh(z) = (e^ — e“^)/(e^ + e~^). It can be shown that this is the Bayes 
rule with respect to the prior that puts mass 1/2 at m and mass 1/2 at —m. 
Moreover, it can be shown that the risk is not constant but it does satisfy 
9) < r{f, 9) for all 9] see Figure 12.3. Hence, Theorem 12.10 implies that 
9 is minimax. ■ 



^ “Well-behaved” means that the level sets must be convex and symmetric about the origin. 
The result holds up to sets of measure 0. 




12.5 Maximum Likelihood, Minimax, and Bayes 
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12.5 Maximum Likelihood, Minimax, and Bayes 



For parametric models that satisfy weak regularity conditions, the maximum 
likelihood estimator is approximately minimax. Consider squared error loss 
which is squared bias plus variance. In parametric models with large samples, 
it can be shown that the variance term dominates the bias so the risk of the 
MLE 6 roughly equals the variance:^ 

R{e, e) = Ye{0) + bias2 w Ye{e). 

As we saw in Chapter 9, the variance of the MLE is approximately 

W) 

where 1(6) is the Fisher information. Hence, 

(12.10) 

For any other estimator 0' ^ it can be shown that for large n, R(0^ O') > R(0^ 0). 
More precisely, 

limlimsup sup nR(0',0)>—^. (12.11) 

n^oo \0-0'\ce ^{^) 

This says that, in a local, large sample sense, the MLE is minimax. It can also 
be shown that the MLE is approximately the Bayes rule. 

In summary: 



In most parametric models, with large samples, the MLE is approxi- 
mately minimax and Bayes. 



There is a caveat: these results break down when the number of parameters 
is large as the next example shows. 

12.16 Example (Many Normal means). Let Yi ^ N(0i^a‘^ /n), i = l,...,n. 
Let Y = (Yi, . . . , Yn) denote the data and let 6 = (6>i, . . . , 6^) denote the 
unknown parameters. Assume that 

I i=l 



^Typically, the squared bias is order 0{n while the variance is of order 0(n ^). 




202 



12. Statistical Decision Theory 



for some c > 0. In this model, there are as many parameters as observations. ^ 
The MLE \s0 = Y = (Yi, . . . , F^). Under the loss function L{0, 0) = 

6iY ^ the risk of the MLE is R{6^ 6) = cr^. It can be shown that the minimax risk 
is approximately cr ‘^ + c^) and one can find an estimator 0 that achieves 
this risk. Since cr^/ (cr^ + c^) < cr^, we see that 0 has smaller risk than the MLE. 
In practice, the difference between the risks can be substantial. This shows 
that maximum likelihood is not an optimal estimator in high dimensional 
problems. ■ 



12.6 Admissibility 



Minimax estimators and Bayes estimators are “good estimators” in the sense 
that they have small risk. It is also useful to characterize bad estimators. 



1 12.17 Definition. An estimators is inadmissible if there exists another 


rule O' such that 

R{0,0') < 


R{0, 9) 


for all 6 and 


R{0,e') < 


R{9, 9) 


for at least one 0. 


Otherwise, 9 is admissible. 







12.18 Example. Let X ^ 1) and consider estimating 6 with squared 

error loss. Let 0{X) = 3. We will show that 6 is admissible. Suppose not. 
Then there exists a different rule 6' with smaller risk. In particular, i?(3, 6') < 
R{3^0) = 0. Hence, 0 = R{3^6') = f{0'{x) — 3)^/(x; 3)dx. Thus, 6' {x) = 3. 
So there is no rule that beats 6. Even though 6 is admissible it is clearly a 
bad decision rule. ■ 

12.19 Theorem (Bayes Rules Are Admissible). Suppose that © c M and that 

R{0^ 6) is a continuous function of 0 for every 6. Let f be a prior density with 

0 -\~€ 

full support, meaning that, for every 0 and every e > 0, h-e 

0^ be the Bayes ^ rule. If the Bayes risk is finite then 0^ is admissible. 

Proof. Suppose is inadmissible. Then there exists a better rule Q such 
that R{6,6) < R{6,6^) for all 6 and R{0o,0) < R{6o,0^) for some 6^. Let 



^The many Normal means problem is more general than it looks. Many nonparametric esti- 
mation problems are mathematically equivalent to this model. 
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V = R{9o^9^) — R{9o^9) > 0. Since R is continuous, there is an e > 0 such 
that R{9^ 9^) — R{9^ 9) > v j2 for all 9 ^ {9o — e, 9q ^ e). Now, 



r{f,0-^) -r(f,0) = 



J R(e,ef)f{9)de - j R{e,e)f{e)de 

J R(9,9f ) - R{e,9) f{9)d9 



/*^0+e r 



> 



Oo-e 

9o+e 



R{9,9^) - R{9,9)\ f(9)d9 



2 Joo-e 

> 0 . 



f{9)d9 



Hence, r(/, 9^ ) > r(/, 9). This implies that 9^ does not minimize r(/, 9) which 
contradicts the fact that 9^^ is the Bayes rule. ■ 



12.20 Theorem. Let Xi, . . . , ^ X(/i, cr^). Under squared error loss, X is 

admissible. 



The proof of the last theorem is quite technical and is omitted but the idea 
is as follows: The posterior mean is admissible for any strictly positive prior. 
Take the prior to be N[a,h‘^). When is very large, the posterior mean is 
approximately equal to X. 

How are minimaxity and admissibility linked? In general, a rule may be one, 
both, or neither. But here are some facts linking admissibility and minimaxity. 

12.21 Theorem. Suppose that 9 has constant risk and is admissible. Then it 
is minimax. 

Proof. The risk is R{9,9) = c for some c. If 9 were not minimax then 
there exists a rule 9' such that 



R{9, 9') < sup R{9, 9') < sup R{9, 9) = c. 

0 0 

This would imply that 9 is inadmissible. ■ 

Now we can prove a restricted version of Theorem 12.14 for squared error 
loss. 

12.22 Theorem. Let Xi, . . . ,X^ ^ 1). Then, under squared error loss, 

9 = X is minimax. 

Proof. According to Theorem 12.20, 9 is admissible. The risk of is 1/n 
which is constant. The result follows from Theorem 12.21. ■ 
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Although minimax rules are not guaranteed to be admissible they are “close 
to admissible.” Say that 6 is strongly inadmissible if there exists a rule 9' 
and an e > 0 such that R{6, O') < R{9, 6) — e for all 6. 

12.23 Theorem. If 0 is minimax, then it is not strongly inadmissible. 



12.7 Stein’s Paradox 

Suppose that X ^ N(0, 1) and consider estimating 0 with squared error loss. 
From the previous section we know that 9{X) = X is admissible. Now consider 
estimating two, unrelated quantities 0 = { 01 , 62 ) and suppose that X\ ^ 
N{0i, 1) and X 2 ^ X( 6 > 2 , 1) independently, with loss L{0, 0) = Y^j=i {^ 3 ~ ^ jY • 
Not surprisingly, 0{X) = X is again admissible where X = (Xi,X 2 ). Now 
consider the generalization to k normal means. Let 0 = {0\, . . . ,0k)<, X = 
(Xi, . . . , Xk) with Xi N{0i, 1) (independent) and loss L{0, 0) = “ 

0j)‘^. Stein astounded everyone when he proved that, if /c > 3, then 0{X) = X 
is inadmissible. It can be shown that the James- Stein estimator 0 ^ has 
smaller risk, where 0 ^ = {Of, . . . , 0 f), 

0f(X)={l-X^y Xi ( 12 . 12 ) 

and (z)+ = max{z, 0}. This estimator shrinks the X^’s towards 0. The message 
is that, when estimating many parameters, there is great value in shrinking the 
estimates. This observation plays an important role in modern nonparametric 
function estimation. 



12.8 Bibliographic Remarks 

Aspects of decision theory can be found in Casella and Berger (2002), Berger 
(1985), Ferguson (1967), and Lehmann and Casella (1998). 



12.9 Exercises 

1 . In each of the following models, find the Bayes risk and the Bayes esti- 
mator, using squared error loss. 

(a) X ^ Binomial(n,p), p ^ Beta(o,/3). 
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(b) X Poisson(A), A ^ Gamma(ce, /3). 

(c) X ^ N(0^ cf‘^) where cr^ is known and 6 ^ N{a^ h‘^). 

2 . Let Xi, . . . , Xn ^ X(0^ cr^) and suppose we estimate 0 with loss function 
L( 6 >, 6) = {6 — OY . Show that X is admissible and minimax. 

3. Let © = { 6 ^ 1 , . . . ,0k} he 8i hnite parameter space. Prove that the poste- 
rior mode is the Bayes estimator under zero-one loss. 



4. (Casella and Berger ( 2002 ).) Let Xi , . . . ,Xn be a sample from a distri- 
bution with variance cr^. Consider estimators of the form bS‘^ where S‘^ 
is the sample variance. Let the loss function for estimating be 



a 



L{a ,a ) = — - 1 - log 



Find the optimal value of b that minimizes the risk for all cr^. 



5. (Berliner (1983).) Let X ^ Binomial(n,p) and suppose the loss function 
is 

L{P,P) = - - 

V P 



where 0 < p < 1. Consider the estimator p{X) = 0. This estimator falls 
outside the parameter space ( 0 , 1 ) but we will allow this. Show that 
p{X) = 0 is the unique, minimax rule. 



6. (Computer Experiment.) Compare the risk of the mle and the James- 
Stein estimator ( 12 . 12 ) by simulation. Try various values of n and vari- 
ous vectors 0. Summarize your results. 
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Linear and Logistic Regression 



Regression is a method for studying the relationship between a response 
variable Y and a covariate X. The covariate is also called a predictor 
variable or a feature. ^ One way to summarize the relationship between X 
and Y is through the regression function 

r(x) = E{Y\X = x) = J yf{y\x)dy. (13.1) 

Our goal is to estimate the regression function r{x) from data of the form 

In this Chapter, we take a parametric approach and assume that r is linear. 
In Chapters 20 and 21 we discuss nonparametric regression. 

13.1 Simple Linear Regression 

The simplest version of regression is when Xi is simple (one-dimensional) and 
r{x) is assumed to be linear: 

r{x) = /3o + Pix. 



^The term “regression” is due to Sir Francis Galton (1822-1911) who noticed that tall and 
short men tend to have sons with heights closer to the mean. He called this “regression towards 
the mean.” 
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4.0 4.5 5.0 5.5 

log light iiitensit}^ (X) 

FIGURE 13.1. Data on nearby stars. The solid line is the least squares line. 

This model is called the the simple linear regression model. We will make 
the further simplifying assumption that V(e^|X = x) = does not depend 
on X. We can thus write the linear regression model as follows. 

13.1 Definition. The Simple Linear Regression Model 

= /^o + /^i^i + (13.2) 

where E(e^|X^) = 0 and V(e^|X^) = cr^. 

13.2 Example. Figure 13.1 shows a plot of log surface temperature (Y) versus 
log light intensity (X) for some nearby stars. Also on the plot is an estimated 
linear regression line which will be explained shortly. ■ 

The unknown parameters in the model are the intercept /3 q and the slope 
f3i and the variance cr^. Let /3 q and (3i denote estimates of (3q and (3i. The 

fitted line is 

r(x) = /3o + Pix. (13.3) 

The predicted values or fitted values are Yi = r{Xi) and the residuals 
are defined to be 



(13.4) 








13.4 Theorem. The least squares estimates are given by 



Pi 

Po 



EtliX^-Xnr 

Yn - PlXn. 



An unbiased estimate of a‘^ is 




(13.5) 

(13.6) 

(13.7) 



13.5 Example. Consider the star data from Example 13.2. The least squares 
estimates are /3 q = 3.58 and [3i = 0.166. The htted line r{x) = 3.58 + 0.166 x 
is shown in Figure 13.1. ■ 

13.6 Example (The 2001 Presidential Election). Figure 13.2 shows the plot of 
votes for Buchanan (Y) versus votes for Bush (X) in Florida. The least squares 
estimates (omitting Palm Beach County) and the standard errors are 

Po = 66.0991 se{po) = 17.2926 

pi = 0.0035 €e{pi) = 0.0002. 

The htted line is 

Buchanan = 66.0991 + 0.0035 Bush. 

(We will see later how the standard errors were computed.) Figure 13.2 also 
shows the residuals. The inferences from linear regression are most accurate 
when the residuals behave like random normal numbers. Based on the residual 
plot, this is not the case in this example. If we repeat the analysis replacing 
votes with log (votes) we get 

Po = 

A = 



-2.3298 se{po) = 0.3529 
0.730300 s^(A) = 0.0358. 
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7 8 9 10 11 12 13 7 8 9 10 11 12 13 



Bush Bush 

FIGURE 13.2. Voting Data for Election 2000. See example 13.6. 

This gives the ht 

log(Buchanan) = —2.3298 + 0.7303 log(Bush). 

The residuals look much healthier. Later, we shall address the following ques- 
tion: how do we see if Palm Beach County has a statistically plausible out- 
come? ■ 

13.2 Least Squares and Maximum Likelihood 

Suppose we add the assumption that e^|X^ ^ cr^), that is, 
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where /i^ = /3 q + PiXi. The likelihood function is 



i=l 



l[fx{Xi)fy\x{Yi\X,) 

i=l 



Hfx{Xi)xl[fY\x{Yi\X,) 

i=l i=l 

Cl X £2 



where £i = OILi fx{Xi) and 



£2 



WfY\x{Yi\X,). 






(13.8) 



The term C\ does not involve the parameters /3 q and j3i. We shall focus on 
the second term C 2 which is called the conditional likelihood, given by 



C 2 = £(/ 3 o ,/ 3 i , ct ) = Y[fY\x{Yi\Xi) oca ”exp 
i=l 

The conditional log-likelihood is 



1 






1 X X 

£{Po,l3i,a) = -nloga- 

i=l 

To hnd the mle of (/3o,/3i) we maximize ^(/3q, / 3i, cr). From (13.9) we see that 
maximizing the likelihood is the same as minimizing the RSS XlILi ~ (/^o + 

/3iV)) . Therefore, we have shown the following: 



u - iPo + PiXi) 



(13.9) 



13.7 Theorem. Under the assumption of Normality, the least squares estima- 
tor is also the maximum likelihood estimator. 



We can also maximize £{Po,Pi,a) over a, yielding the MLE 

a = - y . 

n ^ " 



(13.10) 



This estimator is similar to, but not identical to, the unbiased estimator. 
Common practice is to use the unbiased estimator (13.7). 
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13.3 Properties of the Least Squares Estimators 

We now record the standard errors and limiting distribution of the least 
squares estimator. In regression problems, we usually focus on the proper- 
ties of the estimators conditional on = (Xi, . . . ,X^). Thus, we state the 
means and variances as conditional means and variances. 

13.8 Theorem. Let = (/3o,/?i)^ denote the least squares estimators. 
Then, 

E(^|X”) = ( ^; ) 

, Xf -Xn \ 

\ -^n j- / 

where ~ ^n)^- 

The estimated standard errors of (3q and (3i are obtained by taking the 
square roots of the corresponding diagonal terms of V(/3|X^) and inserting 
the estimate a for cr. Thus, 



V ^ ^ ^ 


(13.12) 


se(^i) = 

Sx yjn 


(13.13) 



We should really write these as se{(3o\^^) and se{(3i\X^) but we will use the 
shorter notation ^(/3o) and ^(/3i). 

13.9 Theorem. Under appropriate eonditions we have: 

^ p ^ p 

1. (Consisteney): /3o — > f3o and f3i — > f3i. 

2. (Asymptotie Normality): 

^^2— ^ 7V(0 1) and ^ X(0, 1). 

s^e(/3o) s-e(/3i) 

3. Approximate 1 — a confidence intervals for fio and fii are 

fio ± z^j 2 se(fio) and ± ^(A). 



(13.14) 






13.4 Prediction 
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J^. The Wald test ^ for testing : /3i = 0 versus Hi : (3i ^ 0 is: rejeet Hq 
if \W\ > z^j 2 where W = ^i/se0i). 

13.10 Example. For the election data, on the log scale, a 95 percent conh- 
dence interval is .7303 ± 2(.0358) = (. 66 , .80). The Wald statistics for testing 
^0 • A = 0 versus i7i : /3i 7 ^ 0 is |VF| = |.7303 — 0|/.0358 = 20.40 with a 
p-value of P(|Z| > 20.40) 0. This is strong evidence that that the true slope 

is not 0 . ■ 



13.4 Prediction 

Suppose we have estimated a regression model r(x) = /3q + /3ix from data 
(Xi,Ti), . . . , (X^,l^). We observe the value X = of the covariate for a 
new subject and we want to predict their outcome Y^. An estimate of is 

Y,=^o^Ax.. (13.15) 

Using the formula for the variance of the sum of two random variables, 

V(t;) = V(^o + PixV = Y{j3o) + xIY{A) + 2x*Cov(^0, 4i)- 

Theorem 13.8 gives the formulas for all the terms in this equation. The es- 
timated standard error ^(Y^) is the square root of this variance, with a‘^ in 
place of However, the conhdence interval for Y^ is not of the usual form 
Y^ ± reason for this is explained in Exercise 10. The correct form 

of the conhdence interval is given in the following theorem. 



13.11 Theorem (Prediction Interval). Let 




VnEi(V-X7 ) 


(13.16) 


An approximate 1 — a prediction interval for Y^ is 




Y^ zb ZQ,j2 <^n- 


(13.17) 



^Recall from equation (10.5) that the Wald statistic for testing Hq \ (3 = (3q versus H\ : 

d/do is VP = (^-/3o)/s^(d). 
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13.12 Example (Election Data Revisited). On the log scale, our linear regres- 
sion gives the following prediction equation: 

log(Buchanan) = —2.3298 + 0.7303 log(Bush). 

In Palm Beach, Bush had 152,954 votes and Buchanan had 3,467 votes. On the 
log scale this is 11.93789 and 8.151045. How likely is this outcome, assuming 
our regression model is appropriate? Our prediction for log Buchanan votes 
-2.3298 + .7303 (11.93789)=6.388441. Now, 8.151045 is bigger than 6.388441 
but is it “signihcantly” bigger? Let us compute a confidence interval. We 
find that = .093775 and the approximate 95 percent confidence interval is 
(6.200,6.578) which clearly excludes 8.151. Indeed, 8.151 is nearly 20 standard 
errors from . Going back to the vote scale by exponentiating, the confidence 
interval is (493,717) compared to the actual number of votes which is 3,467. 



13.5 Multiple Regression 

Now suppose that the covariate is a vector of length k. The data are of the 
form 

where 

— (^M? • • • 

Here, Xi is the vector of k covariate values for the observation. The linear 
regression model is 

k 

Yi = Pj^ij + (13.18) 

J=1 

for i = 1, . . . , n, where E(ei|Xi^, . . . , Xki) = 0. Usually we want to include an 
intercept in the model which we can do by setting Xu = 1 for i = 1, . . . , n. At 
this point it will be more convenient to express the model in matrix notation. 
The outcomes will be denoted by 
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and the covariates will be denoted by 





Ai2 . 


.. Vfe 


^21 


A 22 . 


.. X2k 


Wi 


X„2 . 


• • ^nk 



Each row is one observation; the columns correspond to the k covariates. Thus, 
X is a (n X /c) matrix. Let 



/3 = : and ^ = • 

\ Pk ) \ J 

Then we can write (13.18) as 

Y = X0 + €. (13.19) 

The form of the least squares estimate is given in the following theorem. 



13.13 Theorem. Assuming that the {k x k) matrix X is invertible, 

/3 = (13.20) 

V(/?|X”) = y{X'^X)-^ (13.21) 

13 « N{p,y{X'^X)-^). (13.22) 



The estimate regression function is r(x) = Pj^j- unbiased esti- 

mate of y is 




where ? = X(3 — T is the vector of residuals. An approximate 1—a confidence 
interval for f3j is 

Pj±z^f2se0j) (13.23) 

where se^((3j) is the diagonal element of the matrix 



13.14 Example. Crime data on 47 states in 1960 can be obtained from 
http://hb.stat.cmu.edu/DASL/Stories/USCrime.html. 

If we ht a linear regression of crime rate on 10 variables we get the following: 
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Covariate 




se(/t) 


t value 


p-value 


(Intercept) 


-589.39 


167.59 


-3.51 


0.001 ** 


Age 


1.04 


0.45 


2.33 


0.025 * 


Southern State 


11.29 


13.24 


0.85 


0.399 


Education 


1.18 


0.68 


1.7 


0.093 


Expenditures 


0.96 


0.25 


3.86 


0.000 *** 


Labor 


0.11 


0.15 


0.69 


0.493 


Number of Males 


0.30 


0.22 


1.36 


0.181 


Population 


0.09 


0.14 


0.65 


0.518 


Unemployment (14-24) 


-0.68 


0.48 


-1.4 


0.165 


Unemployment (25-39) 


2.15 


0.95 


2.26 


0.030 * 


Wealth 


-0.08 


0.09 


-0.91 


0.367 


This table is typical of the output of a multiple regression program. The “t- 



value” is the Wald test statistic for testing : /3 j = 0 versus Hi : fSj 7^ 0. The 

asterisks denote “degree of signihcance” and more asterisks denote smaller 
p- values. The example raises several important questions: (1) should we elim- 
inate some variables from this model? (2) should we interpret these relation- 
ships as causal? For example, should we conclude that low crime prevention 
expenditures cause high crime rates? We will address question (1) in the next 
section. We will not address question (2) until Chapter 16. ■ 



13.6 Model Selection 

Example 13.14 illustrates a problem that often arises in multiple regression. 
We may have data on many covariates but we may not want to include all of 
them in the model. A smaller model with fewer covariates has two advantages: 
it might give better predictions than a big model and it is more parsimonious 
(simpler). Generally, as you add more variables to a regression, the bias of the 
predictions decreases and the variance increases. Too few covariates yields high 
bias; this called underfitting. Too many covariates yields high variance; this 
called overfitting. Good predictions result from achieving a good balance 
between bias and variance. 

In model selection there are two problems: (i) assigning a “score” to each 
model which measures, in some sense, how good the model is, and (ii) search- 
ing through all the models to hnd the model with the best score. 

Let us hrst discuss the problem of scoring models. Let S C {1, . . . , /c} and 
let JYs = {^j • j ^ S} denote a subset of the covariates. Let (3s denote the 
coefficients of the corresponding set of covariates and let (3s denote the least 
squares estimate of (3s- Also, let Xs denote the X matrix for this subset of 
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covariates and define rs{x) to be the estimated regression function. The pre- 
dicted values from model S are denoted by Yi{S) = The prediction 

risk is defined to be 



R(S) = E(fi(5) - Y*f (13.24) 

i=l 

where Y* denotes the value of a future observation of Yi at covariate value 
Xi. Our goal is to choose S to make R{S) small. 

The training error is defined to be 

n 

Ru{S) = YYs) -Yif. 

i=l 

This estimate is very biased as an estimate of R{S). 

13.15 Theorem. The training error is a downward-biased estimate of the pre- 
diction risk: 

E{Rtr{S)) < R{S). 

In fact, 



bias(i?tr(^)) = E(i?tr(^)) - R{S) = -2Y Cov(fi, Yi). (13.25) 

^=1 

The reason for the bias is that the data are being used twice: to estimate 
the parameters and to estimate the risk. When we fit a complex model with 
many parameters, the covariance Cov(T^,T^) will be large and the bias of the 
training error gets worse. Here are some better estimates of risk. 

Mallow’s Cp statistic is defined by 



R{S) = Rtr{S) + 2\S\d‘^ (13.26) 

where \S\ denotes the number of terms in S and is the estimate of 
obtained from the full model (with all covariates in the model). This is simply 
the training error plus a bias correction. This estimate is named in honor of 
Cohn Mallows who invented it. The first term in (13.26) measures the fit of 
the model while the second measure the complexity of the model. Think of 
the Cp statistic as: 



lack of fit + complexity penalty. 



Thus, finding a good model involves trading off fit and complexity. 
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A related method for estimating risk is AIC (Akaike Information Cri- 
terion). The idea is to choose S to maximize 

es - 1^1 (13.27) 

where is the log-likelihood of the model evaluated at the mle. ^ This can 
be thought of “goodness of ht” minus “complexity.” In linear regression with 
Normal errors (and taking a equal to its estimate from the largest model), 
maximizing AIC is equivalent to minimizing Mallow’s Cp; see Exercise 8. The 
appendix contains more explanation about AIC. 

Yet another method for estimating risk is leave-one-out cross-validation. 
In this case, the risk estimator is 

n 

Rcv{S) = J2{Yi-%^f (13.28) 

where is the prediction for Yi obtained by htting the model with Yi omit- 
ted. It can be shown that 

where Ua{S) is the diagonal element of the matrix 

U{S) = Xs{X^Xs)-^X^. (13.30) 

Thus, one need not actually drop each observation and re-ht the model. A 
generalization is k-fold cross-validation. Here we divide the data into k 
groups; often people take k = 10. We omit one group of data and ht the 
models to the remaining data. We use the htted model to predict the data 
in the group that was omitted. We then estimate the risk by ^^(Yi — Y^)^ 
where the sum is over the the data points in the omitted group. This process is 
repeated for each of the k groups and the resulting risk estimates are averaged. 

For linear regression. Mallows Cp and cross-validation often yield essentially 
the same results so one might as well use Mallows’ method. In some of the 
more complex problems we will discuss later, cross-validation will be more 
useful. 

Another scoring method is BIC (Bayesian information criterion). Here we 
choose a model to maximize 

BIC(5)=^s-^logn. (13.31) 



^Some texts use a slightly different definition of AIC which involves multiplying the definition 
here by 2 or -2. This has no effect on which model is selected. 
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The BIC score has a Bayesian interpretation. Let S = {Si , . . . , Sm} denote 
a set of models. Suppose we assign the prior = Ijm over the models. 

Also, assume we put a smooth prior on the parameters within each model. It 
can be shown that the posterior probability for a model is approximately, 

P(5'j|data) ^ ^Bic{Sr-) ' 

Hence, choosing the model with highest BIC is like choosing the model with 
highest posterior probability. The BIC score also has an information-theoretic 
interpretation in terms of something called minimum description length. The 
BIC score is identical to Mallows Cp except that it puts a more severe penalty 
for complexity. It thus leads one to choose a smaller model than the other 
methods. 

Now let us turn to the problem of model search. If there are k covariates 
then there are 2^ possible models. We need to search through all these models, 
assign a score to each one, and choose the model with the best score. If k is 
not too large we can do a complete search over all the models. When k is large, 
this is infeasible. In that case we need to search over a subset of all the models. 
Two common methods are forward and backward stepwise regression. 
In forward stepwise regression, we start with no covariates in the model. We 
then add the one variable that leads to the best score. We continue adding 
variables one at a time until the score does not improve. Backwards stepwise 
regression is the same except that we start with the biggest model and drop 
one variable at a time. Both are greedy searches; nether is guaranteed to 
hnd the model with the best score. Another popular method is to do random 
searching through the set of all models. However, there is no reason to expect 
this to be superior to a deterministic search. 

13.16 Example. We applied backwards stepwise regression to the crime data 
using AIC. The following was obtained from the program R. This program 
uses a slightly different definition of AIC. With their definition, we seek the 
smallest (not largest) possible AIC. This is the same is minimizing Mallows 

The full model (which includes all covariates) has AIC= 310.37. In ascend- 
ing order, the AIC scores for deleting one variable are as follows: 



variable 


Pop 


Labor 


South 


Wealth 


Males 


U1 


Educ. 


U2 


Age 


Expend 


AIC 


308 


309 


309 


309 


310 


310 


312 


314 


315 


324 



For example, if we dropped Pop from the model and kept the other terms, 
then the AIC score would be 308. Based on this information we drop “pop- 
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ulation” from the model and the current AIC score is 308. Now we consider 
dropping a variable from the current model. The AIC scores are: 



variable 


South 


Labor 


Wealth 


Males 


U1 


Education 


U2 


Age 


Expend 


AIC 


308 


308 


308 


309 


309 


310 


313 


313 


329 



We then drop “Southern” from the model. This process is continued until 
there is no gain in AIC by dropping any variables. In the end, we are left with 
the following model: 

Crime = 1.2 Age + .75 Education + .87 Expenditure 

+ .34 Males - .86 U1 + 2.31 U2. 

Warning! This does not yet address the question of which variables are 
causes of crime. ■ 

There is another method for model selection that avoids having to search 
through all possible models. This method, which is due to Zheng and Loh 
(1995), does not seek to minimize prediction errors. Rather, it assumes some 
subset of the /3j’s are exactly equal to 0 and tries to find the true model, 
that is, the smallest sub-model consisting of nonzero Pj terms. The method 
is carried out as follows. 

Zheng-Loh Model Selection Method ^ 

1. Eit the full model with all k covariates and let Wj = Pj/^(Pj) denote 

the Wald test statistic for : /3 ^ = 0 versus Hi : Pj ^ 0. 

2. Order the test statistics from largest to smallest in absolute value: 

|T^(i)|>|T4^(2)l>--->|VK(fc)|. 

3. Let 2 be the value of j that minimizes 

RSS(j) + 2 logn 

where RSS(j) is the residual sums of squares from the model with 

the 2 largest Wald statistics. 

4. Choose, as the final model, the regression with the 2 terms with the 

largest absolute Wald statistics. 
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Zheng and Loh showed that, under appropriate conditions, this method 
chooses the true model with probability tending to one as the sample size 
increases. 



13.7 Logistic Regression 



So far we have assumed that Yi is real valued. Logistic regression is a para- 
metric method for regression when Yi G {0, 1} is binary. For a /c-dimensional 
covariate X, the model is 



Pi=p,{i3) = ny^ = Mx = x) 



Pj^ij 

2 _|_ ^Po-\~Yj = l Pj^ij 



(13.32) 



or, equivalently. 



where 



k 

logit(pi) = /?o + 

J = 1 



logit (p) = log 



P \ 
l-pj ' 



(13.33) 



(13.34) 



The name “logistic regression” comes from the fact that e^/(l + e^) is called 
the logistic function. A plot of the logistic for a one-dimensional covariate is 
shown in Figure 13.3. 

Because the T^’s are binary, the data are Bernoulli: 



Yi\Xi = Xi ^ Bernoulli (p^). 

Hence the (conditional) likelihood function is 

n 

^(/3) = (13.35) 

^=1 



^This is just one version of their method. In particular, the penalty j logn is only one choice 
from a set of possible penalty functions. 
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The MLE (3 has to be obtained by maximizing C{l3) numerically. There is 
a fast numerical algorithm called reweighted least squares. The steps are as 
follows: 



Reweighted Least Squares Algorithm 

Choose starting values (3^ = (j3q, . . . , /3^) and compute using equation 
(13.32), for i = 1, . . . , n. Set 5 = 0 and iterate the following steps until 
convergence. 

1. Set ^ ^ 

Zi = logit(pl) + i = 

pf(l - Pi) 

2. Let LL be a diagonal matrix with (i, i) element equal to p|(l — pi). 

3. Set 

^ = (X^WX)-^X^WY. 

This corresponds to doing a (weighted) linear regression of Z on T. 

4. Set 5 = 5 + 1 and go back to the first step. 



The Fisher information matrix I can also be obtained numerically. The 
estimate standard error of j3j is the (j,j) element of J = . Model selection 

is usually done using the AIC score Is ~ |^|- 



13.17 Example. The Coronary Risk-Factor Study (CORIS) data involve 462 
males between the ages of 15 and 64 from three rural areas in South Africa, 
(Rousseauw et al. (1983)). The outcome Y is the presence {Y = 1) or absence 
(Y = 0) of coronary heart disease. There are 9 covariates: systolic blood 
pressure, cumulative tobacco (kg), Idl (low density lipoprotein cholesterol), 
adiposity, famhist (family history of heart disease), typea (type- A behavior), 
obesity, alcohol (current alcohol consumption), and age. A logistic regression 
yields the following estimates and Wald statistics Wj for the coefficients: 
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Covariate f3j 



Intercept 


-6.145 


sbp 


0.007 


tobacco 


0.079 


Idl 


0.174 


adiposity 


0.019 


famhist 


0.925 


typea 


0.040 


obesity 


-0.063 


alcohol 


0.000 


age 


0.045 



Wj 


p-value 


-4.738 


0.000 


1.138 


0.255 


2.991 


0.003 


2.925 


0.003 


0.637 


0.524 


4.078 


0.000 


3.233 


0.001 


-1.427 


0.153 


0.027 


0.979 


3.754 


0.000 



se 

1.300 

0.006 

0.027 

0.059 

0.029 

0.227 

0.012 

0.044 

0.004 

0.012 



Are you surprised by the fact that systolic blood pressure is not signihcant 
or by the minus sign for the obesity coefficient? If yes, then you are confusing 
association and causation. This issue is discussed in Chapter 16. The fact 
that blood pressure is not signihcant does not mean that blood pressure is 
not an important cause of heart disease. It means that it is not an important 
predictor of heart disease relative to the other variables in the model. ■ 



13.8 Bibliographic Remarks 

A succinct book on linear regression is Weisberg (1985). A data-mining view 
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13.9 Appendix 

The Akaike Information Criterion (AIC). Consider a set of models 
{Ml, M 2 , . . .}. Let fj{x) denote the estimated probability function obtained 
by using the maximum likelihood estimator of model Mj . Thus, fj{x) = 
f{x;f3j) where fSj is the mle of the set of parameters fSj for model Mj . We 
will use the loss function D(/, /) where 

is the Kullback-Leibler distance between two probability functions. The cor- 
responding risk function is i?(/, /) = E(D(/, /). Notice that D(/, /) = c — 
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A(/, /) where c = f{x) \ogf{x) does not depend on / and 



Thus, minimizing the risk is equivalent to maximizing a(/, /) = E(A(/, /)). 

It is tempting to estimate a(/, /) by f{x) \og f{x) but, just as the train- 
ing error in regression is a highly biased estimate of prediction risk, it is also 
the case that f{x) \og f{x) is a highly biased estimate of a(/, /). In fact, 
the bias is approximately equal to \Mj\. Thus: 

13.18 Theorem. AIC{Mj) is an approximately unbiased estimate ofa{f/f). 



13.10 Exercises 

1. Prove Theorem 13.4. 

2. Prove the formulas for the standard errors in Theorem 13.8. You should 
regard the X^’s as hxed constants. 

3. Consider the regression through the origin model: 

Y, = (3X, + e. 

Find the least squares estimate for (3. Find the standard error of the 
estimate. Find conditions that guarantee that the estimate is consistent. 

4. Prove equation (13.25). 

5. In the simple linear regression model, construct a Wald test for Hq : 
[3i = 17/3o versus Hi : Pi ^ 17/3o. 

6. Get the passenger car mileage data from 
http://lib.stat.cmu.edu/DASL/Datahles/carmpgdat.html 

(a) Fit a simple linear regression model to predict MPG (miles per 
gallon) from HP (horsepower). Summarize your analysis including a 
plot of the data with the htted line. 

(b) Repeat the analysis but use log(MPG) as the response. Compare 
the analyses. 
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7. Get the passenger car mileage data from 

ht t p : / / lib . st at . emu .edu/DASL/Datafiles/carmpgdat.html 

(a) Fit a multiple linear regression model to predict MPG (miles per 
gallon) from the other variables. Summarize your analysis. 

(b) Use Mallow Cp to select a best sub-model. To search through the 
models try (i) forward stepwise, (ii) backward stepwise. Summarize your 
hndings. 

(c) Use the Zheng-Loh model selection method and compare to (b). 

(d) Perform all possible regressions. Gompare Cp and BIG. Gompare the 
results. 

8. Assume a linear regression model with Normal errors. Take a known. 
Show that the model with highest AIG (equation (13.27)) is the model 
with the lowest Mallows Cp statistic. 

9. In this question we will take a closer look at the AIG method. Let 
Xi, . . . , Xn be IID observations. Consider two models Ado and Adi. Un- 
der Ado the data are assumed to be X(0, 1) while under Adi the data 
are assumed to be N (6^1) for some unknown 6^ G M: 

Ado: Xi,...,X, ^ X(0,1) 

Adi : Xi,...,X^ - 

This is just another way to view the hypothesis testing problem: : 

= 0 versus Hi : 0 ^ Let ^n(^) be the log-likelihood function. 
The AIG score for a model is the log-likelihood at the mle minus the 
number of parameters. (Some people multiply this score by 2 but that 
is irrelevant.) Thus, the AIG score for Ado is AICq = ^n(O) and the AIG 
score for Adi is AIC\ = ^n(^) — 1- Suppose we choose the model with 
the highest AIG score. Let Jn denote the selected model: 

Jo if AICo > AICi 
\ 1 if AICi > AICo. 

(a) Suppose that Ado is the true model, i.e. 6 = 0. Find 

lim R {Jn = 0) . 

n^oo 

Now compute lim^^oo R ( = 0) when 0^0. 
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(b) The fact that lim^^oo IP = 0) 7^ 1 when 0 = 0 is why some people 
say that AIC “overhts.” But this is not quite true as we shall now see. 
Let (j)0{x) denote a Normal density function with mean 0 and variance 
1. Dehne 

? - j Mx) if Jn = 0 

’ ~ I if Jn = 1. 

If = 0, show that D{(j)o^ /^) A 0 as n ^ oo where 

W.s) = //Miog(®)<fc 

is the Kullback-Leibler distance. Show also that D{(j)Q^ /^) A 0 if 7 ^ 0. 
Hence, AIC consistently estimates the true density even if it “over- 
shoots” the correct model. 

(c) Repeat this analysis for BIC which is the log-likelihood minus (p/2) log n 
where p is the number of parameters and n is sample size. 



10. In this question we take a closer look at prediction intervals. Let 0 = 
f3o + f3iX^ and let 0 = (Sq PiX^. Thus, = 6 while + e. Now, 



N{0, se^ 



:V(^)=V(/3o+Ax, 



Note that X(0) is the same as V(T^). Now, ^±2y V(^) is an approximate 
95 percent conhdence interval for P = /3o+/3ix* using the usual argument 
for a confidence interval. But, as you shall now show, it is not a valid 
confidence interval for Y^. 

(a) Let Show that 



P(y, - 2s < F* < F* 



P( -2<iVlO,l + ^j <2 



^ 0.95. 



(b) The problem is that the quantity of interest F* is equal to a param- 
eter d plus a random variable. We can fix this by defining 

=V(F*)+(j2= + ^2. 

[nY.i{xz-x)^ J 

In practice, we substitute a for a and we denote the resulting quantity 
by Now consider the interval Y^ ± 2 Show that 

¥(Y,-2^n <Y,< E +2^;,) ?^P(-2 < A^(0,1) < 2) 0.95. 
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Get the Coronary Risk-Factor Study (CORIS) data from the book web 
site. Use backward stepwise logistic regression based on AIC to select a 
model. Summarize your results. 




14 

Multivariate Models 



In this chapter we revisit the Multinomial model and the multivariate Normal. 
Let us hrst review some notation from linear algebra. In what follows, x and 
y are vectors and M is a matrix. 



Linear Algebra Notation 



inner product ^jVj 
\A\ determinant 

transpose of A 
A~^ inverse of A 

I the identity matrix 

tr(M) trace of a square matrix; sum of its diagonal elements 

y^l/2 

square root matrix 



The trace satishes ti{AB) — ti{BA) and tr(M) +tr(5). Also, tr(a) — a\i a 
is a scalar. A matrix is positive definite if > 0 for all nonzero vectors 

X. If a matrix A is symmetric and positive dehnite, its square root A^A exists 
and has the following properties: (1) is symmetric; (2) A = 

(3) = A~^Aj\^/‘^ = j where A~^A — 
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14.1 Random Vectors 



Multivariate models involve a random vector X of the form 

/ Vi \ 

X = : 

\Xk) 

The mean of a random vector X is dehned by 

/ Ml \ ( E{X,) \ 



M = 



V 



( 14 . 1 ) 



V E{Xk) J 



The covariance matrix S, also written V(X), is defined to be 



S = 



V(Xi) Cov(Xi,X2) ■■■ Cov(Xi,Xfe) 

Cov(X2,Xi) V(X2) Cov(X2,Xfe) 

Cov(Xfe,Xi) Cov(Xfe,X2) V(Xfe) 



( 14 . 2 ) 



This is also called the variance matrix or the variance-covariance matrix. The 
inverse is called the precision matrix. 

14.1 Theorem. Let a he a veetor of length k and let X he a random veetor 
of the same length with mean /i and variance E. Then E(a^X) = ji and 
V(a^X) = a^Ea. If A is a matrix with k columns, then E(MX) = Aji and 
W{AX) = ATA^. 



Now suppose we have a random sample of n vectors 

X 
X22 

. 7 • • • 7 

V Vfc2 / \ Xkn J 



/ Xn \ / X12 \ 

V21 



( \ 

Vn 



V Vfci ) 

The sample mean X is a vector defined by 

/ V \ 



( 14 . 3 ) 



X = 



V / 
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where Xi = n ^ - The sample variance matrix, also called the co- 

variance matrix or the variance-covariance matrix, is 



5ll 


512 • 


’ • Slk 


512 


522 • ' 


' • S2k 


Slk 


S2k • ' 


‘ • Skk 



where 

1 ^ _ _ 
Safe = - X,){Xbj - Xb). 

^ i = l 

It follows that E(X) = /i. and 1K{S) = E. 



( 14 . 4 ) 



14.2 Estimating the Correlation 



Consider n data points from a bivariate distribution: 



Vi 

Wi 



Xi2 

W2 



X 

X 



In 

2n 



Recall that the correlation between Xi and X 2 is 

^ ^ E((Xi - ^i)(X2 - M2)) 

( 71(72 

where cr| = V(Xj^), j = 1,2. The nonparametric plug-in estimator is the 
sample correlation ^ 



EtiiXii - Xi){X2^ - X 2 ) 

^ S1S2 



( 14 . 6 ) 



where 

n 

d = — 

i=l 

We can construct a confidence interval for p by applying the delta method. 
However, it turns out that we get a more accurate confidence interval by first 
constructing a confidence interval for a function 0 = f{p) and then applying 



^More precisely, the plug-in estimator has n rather than n — 1 in the formula for Sj but this 
difference is small. 
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the inverse function / The method, due to Fisher, is as follows: Dehne / 
and its inverse by 



f(r) 

rHz) 



1 ^log(l + r) - log(l - r)^ 

- 1 
+ 1 



Approximate Confidence Interval for The Correlation 

1. Compute 

^ = /(p) = \ (^log(l + p) - log(l - . 

2. Compute the approximate standard error of 6 which can be shown to 

be 




3. An approximate 1 — a conhdence interval for 0 = f[p) is 




4. Apply the inverse transformation / ^{z) to get a conhdence interval 
for p: 

+ 1’ + ly 

Yet another method for getting a conhdence interval for p is to use the 
bootstrap. 

14.3 Multivariate Normal 

Recall that a vector X has a multivariate Normal distribution, denoted by 
X ^ Y(/i, E), if its density is 

f(x; p,S) = ( 2 ^)fc/ 2 |^|i /2 |~^(^ ~ - p)| (14.7) 

where /i is a vector of length k and E is a /c x /c symmetric, positive dehnite 
matrix. Then E(X) = p and V(X) = E. 






14.4 Multinomial 
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14.2 Theorem. The following properties hold: 

1. IfZr^N(Q,l) and X = + then X ^ N{ii,Z). 

2. IfX^N{n,Y2), i/ienS-V2(X-/i) ~iV(0,l)- 

3. If X N{ii, T,) a is a vector of the same length as X, then a^X ~ 
N{of" IX, a^Sa). 

4 . Let 

Then V ^ Xk- 

14.3 Theorem. Given a random sample of size n from a the log- 

likelihood is (up to a constant not depending on ji orT) given by 

£(/i, E) = -^(X- /if E-i(X - /i) - |tr(E-i^) - I log |E|. 

The MLE is 

fL = X and (14.8) 



14.4 Multinomial 



Let us now review the Multinomial distribution. The data take the form 
X = (Xi, . . . ,Xk) where each Xj is a count. Think of drawing n balls (with 
replacement) from an urn which has balls with k different colors. In this case, 
Xj is the number of balls of the k^^ color. Let p = (pi, . . . ^Pk) where pj > 0 
and Pj — 1 suppose that pj is the probability of drawing a ball of 

color j. 



14.4 Theorem. Let X ^ Multinomial(n,p). Then the marginal distribution 
of Xj is Xj ^ Binomial(n,pj). The mean and variance of X are 



E(X) 



npi \ 

npk ) 



and 



Y{X) 



/ npi(l -pi) 
-npip2 



-npip2 
np2{l -P2) 



-npiPk \ 
-np2Pk 

npk{l -Pk) ) 



\ 



-npiPk 



-np2Pk 
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Proof. That Xj ^ Binomial(n,pj) follows easily. Hence, ^{Xj) = npj and 
V(Xj) = npj{l —Pj)- To compute Cov(X^,X^) we proceed as follows: Notice 
that Xi~\-Xj ^ Binomial(n,p^+Pj) and so V(X^+Xj) = n{pi-\-pj){l—pi—pj). 

On the other hand, 

V(X, + X,) = V(X,)+V(X,) + 2 Cov(X„X,) 

= npi{l -pi) + npj{l -pj) + 2Cov{Xi,Xj). 

Equating this last expression with n{pi-\-pj){l—pi—pj) implies that Cov(X^, Xj) = 
-npiPj. m 

14.5 Theorem. The maximum likelihood estimator of p is 




Proof. The log-likelihood (ignoring a constant) is 

k 

i=i 

When we maximize £ we have to be careful since we must enforce the con- 
straint that Pj = 1. We use the method of Lagrange multipliers and instead 
maximize 

Mp) = E ^ “ 0 ■ 

j=i ^ j ^ 

Now 

MM = T + a 

dpj Pj 

Setting = 0 yields pj = —XjjX. Since '^jPj = 1 we see that A = — n 

and hence pj = Xjin as claimed. ■ 

Next we would like to know the variability of the mle. We can either 
compute the variance matrix of p directly or we can approximate the vari- 
ability of the MLE by computing the Fisher information matrix. These two 
approaches give the same answer in this case. The direct approach is easy: 
V(p) = V(X/n) = n-2V(X), and so 

Y{p) = Is 

n 
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where 

E = 


/ pi(i -pi) 

-PlP2 


-P1P2 

P2{l-P2) • 


-piPk \ 

-P2Pk 




\ -PlPk 


-P2Pk 


■ ■ P/c(l -Pk) / 



For large n, p has approximately a multivariate Normal distribution. 

14.6 Theorem. As n ^ oo, 

Vn{p-p) iV(0, S). 



14.5 Bibliographic Remarks 

Some references on multivariate analysis are Johnson and Wichern (1982) and 
Anderson (1984). The method for constructing the conhdence interval for the 
correlation described in this chapter is due to Fisher (1921). 



14.6 Appendix 

Proof of Theorem 14.3. Denote the T*’ random vector by X*. The log- 
likelihood is 

n 

= ^/(V; mA) 

i=l 

= log(27r) - d log |S| - 1 ~ - p). 

Now, 

n 

- Mfs-hx* - m) 

i=l 

n 

= ^[(V - X) + (X - M)]^s-d(x* - X) + (X - p)] 

i=l 

n 

= Y. [A* - - X)] + n(X - pf^-\X - p) 

i=l 

since Xir=i(^^ — X)S“^(X — /i) = 0. Also, notice that — /4)^S“^(X^ — p) 
is a scalar, so 

n n 

^(X* - /rfS-hV - m) = ^tr[(X*-MfS-hV-M)] 

i=l i=l 
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= ^tr[S-i(X* 

i=l 



= tr 






= ntr 



m)(x* - /^ f ] 
M)(X^ - Mf 



and the conclusion follows. ■ 



14.7 Exercises 



1. Prove Theorem 14.1. 

2. Find the Fisher information matrix for the mle of a Multinomial. 

3. (Computer Experiment.) Write a function to generate nsim observations 
from a Multinomial(77,/i) distribution. 

4. (Computer Experiment.) Write a function to generate nsim observations 
from a Multivariate normal with given mean jjl and covariance matrix 

E. 

5. (Computer Experiment.) Generate 100 random vectors from a 
distribution where 



/i = 



3 

8 



E = 



1 1 
1 2 



Plot the simulation as a scatterplot. Estimate the mean and covariance 
matrix E. Find the correlation p between Xi and X 2 . Compare this 
with the sample correlations from your simulation. Find a 95 percent 
confidence interval for p. Use two methods: the bootstrap and Fisher’s 
method. Compare. 

6. (Computer Experiment.) Repeat the previous exercise 1000 times. Com- 
pare the coverage of the two confidence intervals for p. 




15 

Inference About Independence 



In this chapter we address the following questions: 

(1) How do we test if two random variables are independent? 

(2) How do we estimate the strength of dependence between two 
random variables? 

When Y and Z are not independent, we say that they are dependent or 
associated or related. If V and Z are associated, it does not imply that V 
causes Z or that Z causes Y. Causation is discussed in Chapter 16. 

Recall that we write Z II Z to mean that Y and Z are independent and we 
write Y Z to mean that Y and Z are dependent. 



15.1 Two Binary Variables 

Suppose that Y and Z are both binary and consider data (Zi, Zi), . . ., (lA, Z^). 
We can represent the data as a two-by-two table: 





y = 0 


Y = 1 




Z = 0 


Vo 


Xoi 


Xo. 


Z = 1 


Vo 


Vi 


Xi. 




Xo 


Xi 


n = X. 
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where 



Xij = number of observations for which Y = i and Z = j. 

The dotted subscripts denote sums. Thus, 

V. X.j=Y,Xij, n = X.. = Y,X,^. 

3 i bj 

This is a convention we use throughout the remainder of the book. Denote 
the corresponding probabilities by: 





Y = 0 Y = 1 




z = o 


Poo Poi 


Po- 


Z = 1 


PlO Pll 


Pl- 




P -0 P -1 


1 



where pij = P(Z = i^Y = j). Let X = (Xqo, -Zqi, Xio, Xu) denote the vector 
of counts. Then X ^ Multinomial(n,p) where p = {poo ^ Poi ^ Pio ^ Pii) • It is now 
convenient to introduce two new parameters. 



15.1 Definition. The odds ratio is defined to be 




1 PooPii 
%p = 


(15.1) 


POlPW 


The log odds ratio is defined to be 




7 = log(V')- 


(15.2) 



15.2 Theorem. The following statements are equivalent: 

1 . rnz. 

Z '0=1. 

3. 7 = 0. 

4 . For i,j e {0, 1}, Pij = pi.p.j. 

Now consider testing 

Hq : Y U Z versus Hi : Y TMT Z. (15.3) 

First we consider the likelihood ratio test. Under iLi, X ^ Multinomial(n,p) 
and the mle is the vector p = X/n. Under Hq, we again have that X ^ 
Multinomial(n,p) but the restricted mle is computed under the constraint 
Pij = Pi-P-j This leads to the following test: 
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15.3 Theorem. The likelihood ratio test statistic for (15.3) is 

( 15 . 4 ) 

^=0 J =0 ^ ^ ^ ^ 

Under Hq, T Xi- Thus, an approximate level a test is obtained by 
rejecting Hq when T > Xi,a* 

Another popular test for independence is Pearson’s x^ test. 



15.4 Theorem. Pearson^ s test statistic for independence is 



1 1 



jj ^ ^ ^ ~ 

i=0 j=0 



Ei. 



(15.5) 



where 



p _x.x., 



Under U Xi- Thus, an approximate level a test is obtained by 
rejecting Hq when U > Xi a • 



Here is the intuition for the Pearson test. Under Hq, pij = prp.j, so the 
maximum likelihood estimator of pij under Hq is 



Vij = Pi-P-j = 



n n 



Thus, the expected number of observations in the (i,j) cell is 

^ ^ X,.X.. 

Eij = npij = 



The statistic U compares the observed and expected counts. 



15.5 Example. The following data from Johnson and Johnson (1972) relate 
tonsillectomy and Hodgkins disease. ^ 





Hodgkins Disease 


No Disease 




Tonsillectomy 


90 


165 


255 


No Tonsillectomy 


84 


307 


391 


Total 


174 


472 


646 



^The data are actually from a case-control study; see the appendix for an explanation of 
case-control studies. 
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We would like to know if tonsillectomy is related to Hodgkins disease. The 
likelihood ratio statistic is T = 14.75 and the p- value is P(xi > 14.75) = .0001. 
The statistic is [/ = 14.96 and the p- value is P(xi > 14.96) = .0001. We re- 
ject the null hypothesis of independence and conclude that tonsillectomy is as- 
sociated with Hodgkins disease. This does not mean that tonsillectomies cause 
Hodgkins disease. Suppose, for example, that doctors gave tonsillectomies to 
the most seriously ill patients. Then the association between tonsillectomies 
and Hodgkins disease may be due to the fact that those with tonsillectomies 
were the most ill patients and hence more likely to have a serious disease. ■ 



We can also estimate the strength of dependence by estimating the odds 
ratio ^ and the log-odds ratio 7. 



15.6 Theorem. The mle^5 and 7 are 



^ = V V ^ 7 = log'0- 

The asymptotic standard errors (computed using the delta method) are 



(15.6) 



se(7) 



1111 



tp se(j). 



(15.7) 

(15.8) 



15.7 Remark. For small sample sizes, i) and 7 can have a very large variance. 
In this case, we often use the modihed estimator 



- 7oo + l)(Xn + ^) 
(Xoi + i) (Vo + i) ■ 



(15.9) 



Another test for independence is the Wald test for 7 = 0 given by VF = 
(7 — 0)/^(7). A 1 — a confidence interval for 7 is 7 ± ^a/2^(7)- 

A 1 — ce confidence interval for ip can be obtained in two ways. First, we 
could use xp ± Zq,/ 2^('0). Second, since xp = e'^ we could use 



exp{7±2„/2se(7)} 



(15.10) 



This second method is usually more accurate. 



15.8 Example. In the previous example, 

7 _ 90 X 307 
^ “ 165 X 84 



1.99 



and 

7 = log(1.99) = .69. 




15.2 Two Discrete Variables 
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So tonsillectomy patients were twice as likely to have Hodgkins disease. The 
standard error of 7 is 

n i i 

V 90 84 165 307 

The Wald statistic is W = .69/. 18 = 3.84 whose p-value is P(|^| > 3.84) = 
.0001, the same as the other tests. A 95 per cent conhdence interval for 7 is 
7±2(.18) = (.33, 1.05). A 95 per cent conhdence interval for ijj is = 

(1.39,2.86). ■ 

15.2 Two Discrete Variables 

Now suppose that Y G and Z G {1, . . . , J} are two discrete vari- 

ables. The data can be represented as an / x J table of counts: 



V = 1 


II 

to 


. . Y = j ■ 


II 




Vi 


X12 ■■ 


■ X,j ■ 


■■ Xu 


Xi 




V2 ■■ 


•• Vi • 


■■ Xu 


V, 


Xn 


X /2 ■■ 


■■ Xjj ■ 


■■ Xu 


Xf 


X.i 


X2 


V, ■ 


Xj 


n 



where 

Xij = number of observations for which Z = i and Y = j. 

Consider testing 

Hq \Y il Z versus Hi : Y GiMV Z. (15.11) 



15.9 Theorem. The likelihood ratio test statistic for (15.11) is 



i=lj=l 



(15.12) 



The limiting distribution of T under the null hypothesis of independence 
is xt where z/=(7 — 1)(J — 1). Pearson’s test statistic is 



1=1 j=i 



{Xjj — Ejj) 

Eii 



(15.13) 
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Asymptotically, under Hq, U has a xt distribution where 
v = a-i){j-i). 



15.10 Example. These data are from Dunsmore et al. (1987). Patients with 
Hodgkins disease are classihed by their response to treatment and by histo- 
logical type. 



Type 


Positive Response 


Partial Response 


No Response 




LP 


74 


18 


12 


104 


NS 


68 


16 


12 


96 


MC 


154 


54 


58 


266 


LD 


18 


10 


44 


72 



The test statistic is 75.89 with 2x3 = 6 degrees of freedom. The p-value 
is IP(x6 > 75.89) 0. The likelihood ratio test statistic is 68.30 with 2x3 = 6 

degrees of freedom. The p-value is IP(x6 > 68.30) 0. Thus there is strong 

evidence that response to treatment and histological type are associated. ■ 



15.3 Two Continuous Variables 

Now suppose that Y and Z are both continuous. If we assume that the joint 
distribution of Y and Z is bivariate Normal, then we measure the dependence 
between Y and Z by means of the correlation coefficient p. Tests, estimates, 
and conhdence intervals for p in the Normal case are given in the previous 
chapter in Section 14.2. If we do not assume Normality then we can still use the 
methods in Section 14.2 to draw inferences about the correlation p. However, 
if we conclude that p is 0, we cannot conclude that Y and Z are independent, 
only that they are uncorrelated. Fortunately, the reverse direction is valid: 
if we conclude that Y and Z are correlated than we can conclude they are 
dependent. 



15.4 One Continuous Variable and One Discrete 

Suppose that Y G {1,...,/} is discrete and Z is continuous. Let Fi{z) = 
P(Z < z\Y = i) denote the CDF of Z conditional onY = i. 
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15.11 Theorem. When Y e {1, . . . , 1} is discrete and Z is continuous, then 
Y U Z if and only if Fi = • — = Fj . 

It follows from the previous theorem that to test for independence, we need 
to test 

Hq : Fi = - ' = Fj versus Hi : not Hq. 

For simplicity, we consider the case where I = 2. To test the null hypothesis 
that Fi = F 2 we will use the two sample Kolmogorov-Smirnov test. Let 
ni denote the number of observations for which = 1 and let U 2 denote the 
number of observations for which Yi = 2. Let 

1 

Fi{z) = —y^I{Z,<z)I{Y, = l) 

ni ^ 



and 

I ^ 

F2{z) = — y^I{Z, < z)I{Y, = 2) 

n2 ^ 

denote the empirical distribution function of Z given T = 1 and Y = 2 
respectively. Dehne the test statistic 



D = sup \Fi{x) - ^2(^)1- 

X 

15.12 Theorem. Let 

00 

H{t) = 1 - 

J = 1 

Under the null hypothesis that F\ = F 2 , 



lim F 

n^oo 



nin2 

ni + U2 



D <t 



= H{t) 



(15.14) 



It follows from the theorem that an approximate level a test is obtained by 
rejecting Hu when 



15.5 Appendix 

Interpreting The Odds Ratios. Suppose event A as probability P(A). 
The odds of A are dehned as odds(A) = P(A)/(1 — P(^)). It follows that 
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P(A) = odds(A)/(l + odds(A)). Let E be the event that someone is exposed 
to something (smoking, radiation, etc) and let D be the event that they get 
a disease. The odds of getting the disease given that you are exposed are: 



odd^{D\E) 



¥{D\E) 

1 -¥{D\E) 



and the odds of getting the disease given that you are not exposed are: 



odds{D\E^) 



¥{D\E^) 

l-¥{D\E^y 



The odds ratio is dehned to be 



^ _ odds(D|L;) 

^ “ odds{D\Ey' 



If ^0 = 1 then disease probability is the same for exposed and unexposed. This 
implies that these events are independent. Recall that the log-odds ratio is 
dehned as 7 = log(' 0 ). Independence corresponds to 7 = 0 . 

Consider this table of probabilities and corresponding table of data: 







D 






D 




E^ 


Poo 


Poi 


Po- 


E^ 


Xoo 


Xoi 


Xo. 


E 


Pio 


Pll 


Pl- 


E 


Vo 


Xii 


Xi. 




P-0 


P-1 


1 


Xo 


Xi 


X. 



Now 



F{D\E) 



Pii 

Pio Epii 



and F{D\Ey 



Poi 

Poo + Poi ’ 



and so 



odds(D\E) = — 
Pio 



and odds(D\E^) = 

Poo 



and therefore. 



I PiiPoo 

= . 

PoiPw 



To estimate the parameters, we have to hrst consider how the data were 
collected. There are three methods. 

Multinomial Sampling. We draw a sample from the population and, 
for each person, record their exposure and disease status. In this case, X = 
(Xoo, -^01, -^10, -^11) ^ Multinomial(n,p). We then estimate the probabilities 
in the table by pij = Xijjn and 



7 PiiPoo 
PoiPw 



XiiXoo 

XoiXio' 
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Prospective Sampling. (Cohort Sampling). We get some exposed and 
unexposed people and count the number with disease in each group. Thus, 

Xoi - Binomial(Xo. , ¥{D\E^)) 

Xu - Binomial(Xi.,P(T>|^)). 



We should really write xq. and xi. instead of Xq. and Xi. since in this case, 
these are hxed not random, but for notational simplicity Til keep using capital 
letters. We can estimate ¥{D\E) and F{D\E^) but we cannot estimate all the 
probabilities in the table. Still, we can estimate ip since ^0 is a function of 
¥{D\E) and ¥{D\E^). Now 



and ¥{D\E^) = 

Xq. 

XnXoo 

■ XoiXio 

just as before. 

Case-Control (Retrospective) Sampling. Here we get some diseased 
and non-diseased people and we observe how many are exposed. This is much 
more efficient if the disease is rare. Hence, 



nD\E) = px 

Thus, 



Xio - Binomial(Xo,P(^|T>^)) 
Xii ^ Binomial(Xi, P(X|T))). 



From these data we can estimate ¥{E\D) and ¥{E\D^). Surprisingly, we can 
also still estimate ijj. To understand why, note that 



¥{E\D) 



Pii 

Poi +Pll’ 



1-F(E\D) = — — , odds(E\D) 

Poi Fpii 



Pii 

Poi' 



By a similar argument. 



odds(X|D") = — . 

Poo 



Hence, 

odds{E\D) _ piipoo _ 
odds{E\D^) poiPio 

From the data, we form the following estimates: 



P{E\D) = 4^, 1-P{E\D) = ^s{E\D) = 5ddi(^|Z?<=) 

A.i A.i Aoi 



Xio 

Xoo' 



Therefore, 



7 



XoqXu 

XoiXw 
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So in all three data collection methods, the estimate of ip turns out to be the 
same. 

It is tempting to try to estimate ¥{D\E)—¥{D\E^). In a case-control design, 
this quantity is not estimable. To see this, we apply Bayes’ theorem to get 

p(i>i£) - p(D|£«) = cam _ nmD)np) 

^ ^ ^ ^ P(^) F{E^) 

Because of the way we obtained the data, ¥{D) is not estimable from the data. 
However, we can estimate ^ = F{D\E) /F{D\E^) , which is called the relative 
risk, under the rare disease assumption. 

15.13 Theorem. Let ^ 



as ¥{D) 0. 

Thus, under the rare disease assumption, the relative risk is approximately 
the same as the odds ratio and, as we have seen, we can estimate the odds 
ratio. 



= ¥{D\E)/¥{D\E^). Then 
lb 



15.6 Exercises 

1. Prove Theorem 15.2. 

2. Prove Theorem 15.3. 

3. Prove Theorem 15.6. 

4. The New York Times (January 8, 2003, page A12) reported the following 
data on death sentencing and race, from a study in Maryland: ^ 

Death Sentence No Death Sentence 
Black Victim 14 641 

White Victim 62 594 

Analyze the data using the tools from this chapter. Interpret the results. 
Explain why, based only on this information, you can’t make causal 
conclusions. (The authors of the study did use much more information 
in their full report.) 



^The data here are an approximate re-creation using the information in the article. 
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5. Analyze the data on the variables Age and Financial Status from: 
http://lib.stat.cmu.edu/DASL/Datahles/montanadat.html 

6. Estimate the correlation between temperature and latitude using the 
data from 

http://lib.stat.cmu.edu/DASL/Datahles/USTemperatures.html 

Use the correlation coefficient. Provide estimates, tests, and conhdence 
intervals. 

7. Test whether calcium intake and drop in blood pressure are associated. 
Use the data in 

http://lib.stat.cmu.edu/DASL/Datahles/Calcium.html 




16 

Causal Inference 



Roughly speaking, the statement “X causes y” means that changing the 
value of X will change the distribution of Y. When X causes F, X and Y 
will be associated but the reverse is not, in general, true. Association does not 
necessarily imply causation. We will consider two frameworks for discussing 
causation. The hrst uses counterfactual random variables. The second, pre- 
sented in the next chapter, uses directed acyclic graphs. 



16.1 The Counterfactual Model 

Suppose that X is a binary treatment variable where X = 1 means “treated” 
and X = 0 means “not treated.” We are using the word “treatment” in a 
very broad sense. Treatment might refer to a medication or something like 
smoking. An alternative to “treated/not treated” is “exposed/not exposed” 
but we shall use the former. 

Let Y be some outcome variable such as presence or absence of disease. 
To distinguish the statement “X is associated T” from the statement “X 
causes T” we need to enrich our probabilistic vocabulary. Specihcally, we will 
decompose the response Y into a more hne-grained object. 

We introduce two new random variables ((To,Ci), called potential out- 
comes with the following interpretation: Cq is the outcome if the subject is 
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not treated {X = 0) and Ci is the outcome if the subject is treated (X = 1). 
Hence, 

r Co ifX = 0 

\ Cl if X = 1. 

We can express the relationship between Y and (Co,Ci) more succinctly by 

Y = Cx. (16.1) 

Equation (16.1) is called the consistency relationship. 

Here is a toy dataset to make the idea clear: 

X Y Co Cl 

“O 4 4 * 

0 7 7 * 

0 2 2 * 

0 8 8 * 

13*3 

15*5 

18*8 

19*9 

The asterisks denote unobserved values. When X = 0 we don’t observe Ci, 
in which case we say that Ci is a counterfactual since it is the outcome 
you would have had if, counter to the fact, you had been treated (X = 1). 
Similarly, when X = 1 we don’t observe Cq, and we say that Co is counter- 
factual. There are four types of subjects: 



Type Co Ci 

Survivors 1 1 

Responders 0 1 

Anti-responders 1 0 

Doomed 0 0 



Think of the potential outcomes (Co, Ci) as hidden variables that contain all 
the relevant information about the subject. 

Dehne the average causal effect or average treatment effect to be 

(9 = E(Ci) -E(Co). (16.2) 

The parameter 6 has the following interpretation: 6 is the mean if everyone 
were treated (X = 1) minus the mean if everyone were not treated (X = 0). 
There are other ways of measuring the causal effect. For example, if Cq and 
Cl are binary, we define the causal odds ratio 

P(Ci = 1) . P(Co = 1) 

P(Ci = 0) * P(Co = 0) 




16.1 The Counterfactual Model 
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and the causal relative risk 

= 1 ) 

n^o = i)‘ 

The main ideas will be the same whatever causal effect we use. For simplicity, 
we shall work with the average causal effect 6. 

Define the association to be 

o = E(r|X = 1) - E(r |X = 0). (16.3) 

Again, we could use odds ratios or other summaries if we wish. 

16.1 Theorem (Association Is Not Causation). In general, 0 ^ a. 

16.2 Example. Suppose the whole population is as follows: 

X Y Co Cl 

“O 0 0 0^ 

0 0 0 0 * 

0 0 0 0 * 

0 0 0 0 * 

1 1 1 * 1 

1 1 1 * 1 

1 1 1 * 1 

1 1 1 * 1 

Again, the asterisks denote unobserved values. Notice that Cq = Ci for every 
subject, thus, this treatment has no effect. Indeed, 

0 = 



Thus, the average causal effect is 0. The observed data are only the X’s and 
M’s, from which we can estimate the association: 

ce = E(r|X = 1) -E(r|X = 0) 

l + l + l + l o + o + o + o _ 

“4 4 “ ‘ 

Hence, 0 ^ a. 

To add some intuition to this example, imagine that the outcome variable 
is 1 if “healthy” and 0 if “sick” . Suppose that X = 0 means that the subject 



8 8 

E(Ci)-E(Co) = -^Ch--^Co, 

i=l i=l 
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does not take vitamin C and that X = 1 means that the subject does take 
vitamin C. Vitamin C has no causal effect since Cq = Ci for each subject. In 
this example there are two types of people: healthy people (Cq,Ci) = (1, 1) 
and unhealthy people (Cq^Ci) = (0,0). Healthy people tend to take vitamin 
C while unhealthy people don’t. It is this association between (Cq^Ci) and 
X that creates an association between X and Y. If we only had data on X 
and Y we would conclude that X and Y are associated. Suppose we wrongly 
interpret this causally and conclude that vitamin C prevents illness. Next we 
might encourage everyone to take vitamin C. If most people comply with our 
advice, the population will look something like this: 



X 


V 


Co 


Cl 


0 


0 


0 


0* 


1 


0 


0 


0* 


1 


0 


0 


0* 


1 


0 


0 


0* 


1 


1 


1* 


1 


1 


1 


1* 


1 


1 


1 


1* 


1 


1 


1 


1* 


1 



Now a = (4/7) — (0/1) = 4/7. We see that a went down from 1 to 4/7. 
Of course, the causal effect never changed but the naive observer who does 
not distinguish association and causation will be confused because his advice 
seems to have made things worse instead of better. ■ 

In the last example, 0 = 0 and ce = 1. It is not hard to create examples in 
which ce > 0 and yet 0 < 0. The fact that the association and causal effects 
can have different signs is very confusing to many people. 

The example makes it clear that, in general, we cannot use the association 
to estimate the causal effect 0. The reason that 0 ^ a is that (Cq^Ci) was 
not independent of X. That is, treatment assignment was not independent of 
person type. 

Can we ever estimate the causal effect? The answer is: sometimes. In par- 
ticular, random assignment to treatment makes it possible to estimate 0. 

16.3 Theorem. Suppose we randomly assign subjeets to treatment and that 
P(X = 0) > 0 and P(X = 1) > 0. Then a = 6. Hence, any consistent estima- 
tor of a is a consistent estimator of 0. In particular, a consistent estimator 
is 



0 = E(V|X = l)-E(y|X = 0) 
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= ^i-Fo 
is a consistent estimator of 6, where 



Yi 



n 

ni ^ 



i=\ 



Y^ 



1 

no 






= Er=i ^0 = Er=i(i “ 

Proof. Since X is randomly assigned, X is independent of (Co, Ci). Hence, 



0 = E(Ci)-E(Co) 

= E(Ci|X = 1) - E(Co|X = 0) since X B (Co, Ci) 

= E(F |X = 1) - E(X |X = 0) since Y = Cx 



The consistency follows from the law of large numbers. ■ 

If Z is a covariate, we dehne the conditional causal effect by 

0^ = E(Ci|Z = z)- E(Co|Z = z). 

For example, if Z denotes gender with values Z = 0 (women) and Z = 1 
(men), then is the causal effect among women and 0i is the causal effect 
among men. In a randomized experiment, 6z = E(y|X = 1, Z = z)— E(Z|X = 
0, Z = z) and we can estimate the conditional causal effect using appropriate 
sample averages. 

Summary of the Counterfactual Model 

Random variables: (Co, Ci, X, P). 

Consistency relationship: Y = Cx- 
Causal Effect: 6 = E(Ci) — E(Co). 

Association: a = E(E|X = 1) — E(E|X = 0). 

Random assignment (Co,Ci)UX 0 = a. 



16.2 Beyond Binary Treatments 

Let us now generalize beyond the binary case. Suppose that X G X. For 
example, X could be the dose of a drug in which case X G M. The counterfac- 
tual vector (Co, Cl) now becomes the counterfactual function C(x) where 
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FIGURE 16.1. A counterfactual function C{x). The outcome Y is the value of the 
curve C{x) evaluated at the observed dose X. 

C{x) is the outcome a subject would have if he received dose x. The observed 
response is given by the consistency relation 



Y = C{X). (16.4) 

See Figure 16.1. The causal regression function is 

9{x) =E((C(x)). (16.5) 

The regression function, which measures association, is r{x) = E(U|X = x). 

16.4 Theorem. In general, 0{x) ^ r{x). However, when X is randomly as- 
signed, 6{x) = r{x). 

16.5 Example. An example in which 6{x) is constant but r{x) is not constant 
is shown in Figure 16.2. The hgure shows the counterfactual functions for 
four subjects. The dots represent their X values Xi,X 2 ,Xa,X 4 . Since Ci{x) 
is constant over x for all i, there is no causal effect and hence 

Ci{x) + C 2 {x) + C-^ix) + C/^{x) 



4 
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is constant. Changing the dose x will not change anyone’s outcome. The four 
dots in the lower plot represent the observed data points Yi = Ci(Xi),l 2 = 

The dotted line represents the regression 
r{x) = E(T|X = x). Although there is no causal effect, there is an association 
since the regression curve r{x) is not constant. ■ 



16.3 Observational Studies and Confounding 

A study in which treatment (or exposure) is not randomly assigned is called an 
observational study. In these studies, subjects select their own value of the 
exposure X. Many of the health studies you read about in the newspaper are 
like this. As we saw, association and causation could in general be quite differ- 
ent. This discrepancy occurs in non-randomized studies because the potential 
outcome C is not independent of treatment X. However, suppose we could 
hnd groupings of subjects such that, within groups, X and {C(x) : x G A} 
are independent. This would happen if the subjects are very similar within 
groups. For example, suppose we hnd people who are very similar in age, gen- 
der, educational background, and ethnic background. Among these people we 
might feel it is reasonable to assume that the choice of X is essentially ran- 
dom. These other variables are called confounding variables.^ If we denote 
these other variables collectively as Z, then we can express this idea by saying 
that 

{C{x) : xeXjUXlZ. (16.6) 

Equation (16.6) means that, within groups of Z, the choice of treatment X 
does not depend on type, as represented by {C{x) : x G A}. If (16.6) holds 
and we observe Z then we say that there is no unmeasured confounding. 

16.6 Theorem. Suppose that (16.6) holds. Then, 

(9(x) = j E(F|X = x,Z = z)dFz{z)dz. (16.7) 

Ifr{x,z) is a consistent estimate of the regression function E(T|X — x,Z — 
z), then a consistent estimate of 6{x) is 

I ^ 

= - '^r{x,Zi). 
n 

i=l 



more precise definition of confounding is given in the next chapter. 
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FIGURE 16.2. The top plot shows the counterfactual function C{x) for four sub- 
jects. The dots represent their X values. Since Ci{x) is constant over x for all i, there 
is no causal effect. Changing the dose will not change anyone’s outcome. The lower 
plot shows the causal regression function 6 {x) = {C\{x) + C2{x) + C 3 (x) + Ca{x)) / A. 
The four dots represent the observed data points Y\ = Ci(Xi), Y2 = C2{X2)^ 
T 3 = I 4 = C 4 ,{X^). The dotted line represents the regression 

r{x) = K(Y\X = x). There is no causal effect since Ci{x) is constant for all i. 
But there is an association since the regression curve r{x) is not constant. 
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In particular, if r{x, z) = /3 q + [Iix + (32Z is linear, then a consistent estimate 
of 0{x) is 

6{x) = /3q + PlX + /32^n (16.8) 

where {Po, jSi, (32) are the least squares estimators. 

16.7 Remark. It is useful to compare equation (16.7) to 1K{Y\X = x) which 
can be written as E(Y\X = x) = JE(Y\X = x,Z = z)dFz\x{^\x)- 

Epidemiologists call (16.7) the adjusted treatment effect. The process of 
computing adjusted treatment effects is called adjusting (or controlling) 
for confounding. The selection of what confounders Z to measure and con- 
trol for requires scientific insight. Even after adjusting for confounders, we 
cannot be sure that there are not other confounding variables that we missed. 
This is why observational studies must be treated with healthy skepticism. 
Results from observational studies start to become believable when: (i) the 
results are replicated in many studies, (ii) each of the studies controlled for 
plausible confounding variables, (hi) there is a plausible scientific explanation 
for the existence of a causal relationship. 

A good example is smoking and cancer. Numerous studies have shown a 
relationship between smoking and cancer even after adjusting for many con- 
founding variables. Moreover, in laboratory studies, smoking has been shown 
to damage lung cells. Einally, a causal link between smoking and cancer has 
been found in randomized animal studies. It is this collection of evidence 
over many years that makes this a convincing case. One single observational 
study is not, by itself, strong evidence. Remember that when you read the 
newspaper. 



16.4 Simpson’s Paradox 

Simpson’s paradox is a puzzling phenomenon that is discussed in most statis- 
tics texts. Unfortunately, most explanations are confusing (and in some cases 
incorrect). The reason is that it is nearly impossible to explain the paradox 
without using counterfactuals (or directed acyclic graphs). 

Let X be a binary treatment variable, Y a binary outcome, and Z a third 
binary variable such as gender. Suppose the joint distribution of X, Y, Z is 
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Y = 1 


F = 0 


Y = 1 


y = 0 


X = 1 


.1500 


.2250 


.1000 


.0250 


X = 0 


.0375 


.0875 


.2625 


.1125 




Z = 1 (men) 


Z = 0 (women) 



The marginal distribution for (X, T) is 







Y = 1 


Y = 0 




X 


= 1 


.25 


.25 


.50 


X 


= 0 


.30 


.20 


.50 






.55 


.45 


1 



From these tables we hnd that, 

p(r = i|x = 1) -p(r = i|x = 0) = -0.1 
p(r = i|x = 1, z = 1) - p(r = i|x = 0, z = 1) = o.i 

p(r = i|x = 1, z = 0) - p(r = i|x = 0, z = 0) = o.i. 

To summarize, we seem to have the following information: 

Mathematical Statement English Statement? 

P(y = 1|Z = 1) < P(y = 1|Z = 0) treatment is harmful 

P(y = 1|Z = 1,Z=1)> P(y = 1|Z = 0, Z = 1) treatment is beneficial to men 

P(y = 1|X = 1,Z = 0)> P(y = 1|X = 0, Z = 0) treatment is beneficial to women 

Clearly, something is amiss. There can’t be a treatment which is good for 
men, good for women, but bad overall. This is nonsense. The problem is with 
the set of English statements in the table. Our translation from math into 
English is specious. 

The inequality P(T = 1|X = 1) < P(T = 1|X = 0) does not 
mean that treatment is harmful. 

The phrase “treatment is harmful” should be written mathematically as 
P(Ci = 1) < P(Co = 1). The phrase “treatment is harmful for men” should 
be written P(Ci = 1|Z = 1) < F(Co = 1|Z = 1). The three mathematical 
statements in the table are not at all contradictory. It is only the translation 
into English that is wrong. 

Let us now show that a real Simpson’s paradox cannot happen, that is, 
there cannot be a treatment that is benehcial for men and women but harmful 
overall. Suppose that treatment is benehcial for both sexes. Then 

P(Ci = l\Z = z)> P(Co = l\Z = z) 
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for all z. It then follows that 

p(Ci = 1) = ^ P(Ci = 1\Z = z)¥{Z = z) 

Z 

> J2p(Co = 1IZ = z)P(Z = z) 

Z 

= P(Co = l). 

Hence, ¥(Ci = 1) > F(Co = 1), so treatment is benehcial overall. No paradox. 

16.5 Bibliographic Remarks 

The use of potential outcomes to clarify causation is due mainly to Jerzy Ney- 
man and Donald Rubin. Later developments are due to Jamie Robins, Paul 
Rosenbaum, and others. A parallel development took place in econometrics 
by various people including James Heckman and Charles Manski. Texts on 
causation include Pearl (2000), Rosenbaum (2002), Spirtes et al. (2000), and 
van der Laan and Robins (2003). 



16.6 Exercises 

1. Create an example like Example 16.2 in which ce > 0 and < 0. 

2. Prove Theorem 16.4. 

3. Suppose you are given data (Xi, Ti), . . . , (X^, Y^) from an observational 
study, where X^ G {0, 1} and Yi G {0, 1}. Although it is not possible 
to estimate the causal effect 6>, it is possible to put bounds on 6. Find 
upper and lower bounds on 0 that can be consistently estimated from 
the data. Show that the bounds have width 1. 

Hint: Note that E(Ci) = E(Ci|X = 1)E(X = 1) +E(Ci|X = 0)E(X = 

0 ). 

4. Suppose that X G M and that, for each subject i, Ci{x) = f3ux. Each 
subject has their own slope jdu. Construct a joint distribution on (/3i, X) 
such that E(/3i > 0) = 1 but E(T|X = x) is a decreasing function of x, 
where Y = C{X). Interpret. 

5. Let X G {0, 1} be a binary treatment variable and let (Co,Ci) denote 
the corresponding potential outcomes. Let Y = Cx denote the observed 
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response. Let Fq and Fi be the cumulative distribution functions for 
Co and Ci. Assume that Fq and Fi are both continuous and strictly 
increasing. Let 0 = rrii — mo where mo = Fq“^( 1/2) is the median of Co 
and mi = F^^{l/2) is the median of C\. Suppose that the treatment X 
is assigned randomly. Find an expression for 0 involving only the joint 
distribution of X and Y . 




17 

Directed Graphs and Conditional 
Independence 



17.1 Introduction 

A directed graph consists of a set of nodes with arrows between some nodes. 
An example is shown in Figure 17.1. 

Graphs are useful for representing independence relations between variables. 
They can also be used as an alternative to counterfactuals to represent causal 
relationships. Some people use the phrase Bayesian network to refer to a 
directed graph endowed with a probability distribution. This is a poor choice 
of terminology. Statistical inference for directed graphs can be performed using 



Y 




FIGURE 17.1. A directed graph with vertices V = {A, T, Z} and edges 

E={(Y,X),(Y,Z)}. 




264 



17. Directed Graphs and Conditional Independence 



frequentist or Bayesian methods, so it is misleading to call them Bayesian 
networks. 

Before getting into details about directed acyclic graphs (DAGs), we need 
to discuss conditional independence. 

17.2 Conditional Independence 

17.1 Definition. Let X , Y and Z be random variables. X and Y are 

conditionally independent given Z, written X YY \ Z , if 

fx,Y\z{.^->y\z) = f x\z{A^) fY\z{y\z) • 

for all X, y and z. 

Intuitively, this means that, once you know Z, Y provides no extra infor- 
mation about X. An equivalent dehnition is that 

f{x\y,z) = f{x\z). (17.2) 

The conditional independence relation satisfies some basic properties. 

17.2 Theorem. The following implications hold: ^ 

XUY \Z 
XUY\Z and U = h{X) 

XUY\Z and U = h{X) 

XU.Y\Z and XU.W\{Y,Z) 

XnY\Z and XUZ jY 

17.3 DAGs 

A directed graph Q consists of a set of vertices V and an edge set E of 
ordered pairs of vertices. For our purposes, each vertex will correspond to a 
random variable. If (X, Y) e E then there is an arrow pointing from X to y. 
See Figure 17.1. 

^The last property requires the assumption that all events have positive probability: the first 
four do not. 
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FIGURE 17.2. DAG for Example 17.4. 

If an arrow connects two variables X and Y (in either direction) we say 
that X and Y are adjacent. If there is an arrow from X to Y then X is a 
parent of Y and U is a child of X. The set of all parents of X is denoted 
by TTx or 7t(X). A directed path between two variables is a set of arrows 
all pointing in the same direction linking one variable to the other such as: 

X ^ • • • ^ Y 



A sequence of adjacent vertices staring with X and ending with Y but 
ignoring the direction of the arrows is called an undirected path. The se- 
quence {X, T, Z} in Figure 17.1 is an undirected path. X is an ancestor of 
Y if there is a directed path from X to T (or X = Y). We also say that Y is 
a descendant of X. 

A conhguration of the form: 

X ^ Y ^ Z 



is called a collider at T. A conhguration not of that form is called a non- 
collider, for example, 



X 






^ Z 
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or 



X ^ Y ^ Z 



The collider property is path dependent. In Figure 17.7, Y is a collider on 
the path {X, T, Z} but it is a non-collider on the path {X, Y,W}. When the 
variables pointing into the collider are not adjacent, we say that the collider 
is unshielded. A directed path that starts and ends at the same variable is 
called a cycle. A directed graph is acyclic if it has no cycles. In this case we 
say that the graph is a directed acyclic graph or DAG. From now on, we 
only deal with acyclic graphs. 



17.4 Probability and DAGs 

Let 0 be a DAG with vertices V = (Xi, . . . , Xj.). 



17.3 Definition. If F is a distribution for V with probability function f, 
we say that P is Markov to or that Q represents F, if 

k 

f{v) = Y[f{Xz\ TTi) (17.3) 

i=l 

where tt^ are the parents of Xi. The set of distributions represented by Q 
is denoted by M(Q). 



17.4 Example. Figure 17.2 shows a DAG with four variables. The probability 
function for this example factors as 

/(overweight, smoking, heart disease, cough) 

= /(overweight) x /(smoking) 

X /(heart disease | overweight, smoking) 

X /(cough I smoking). ■ 

17.5 Example. For the DAG in Figure 17.3, P G M{Q) if and only if its 
probability function / has the form 



f{x,y,z,w) = f{x)f{y)f{z \ x,y)f{w \ z). 
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X 




Z 







FIGURE 17.3. Another DAG. 

The following theorem says that P G M(Q) if and only if the Markov 
Condition holds. Roughly speaking, the Markov Condition means that every 
variable W is independent of the “past” given its parents. 

17.6 Theorem. A distribution F G M(Q) if and only if the following Markov 
Condition holds: for every variable W, 

WUWIttiv (17.4) 

where W denotes all the other variables except the parents and descendants 
ofW. 

17.7 Example. In Figure 17.3, the Markov Condition implies that 

X n y and FF n {X, y } I Z. ■ 

17.8 Example. Consider the DAG in Figure 17.4. In this case probability 
function must factor like 

f{a,b,c,d,e) = f{a)f{b\a)f{c\a)f{d\b,c)f{e\d). 

The Markov Condition implies the following independence relations: 

DUA\{B,C}, EU{A,B,C}\ D and BUC\A m 

17.5 More Independence Relations 

The Markov Condition allows us to list some independence relations implied 
by a DAG. These relations might imply other independence relations. Con- 
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B 



A 



D 



C 

FIGURE 17.4. Yet another DAG. 



sider the DAG in Figure 17.5. The Markov Condition implies: 



X,UX 2 , X 2 n{Xi,X 4 }, X3UX4 1 {Xi,X2}, 



X4n{X2,X3> |Xi, X5U{Xi,X2} I {X3,X4} 



It turns out (but it is not obvious) that these conditions imply that 



{X4,Xs}nX2 1 {Xi,X3}. 



How do we hud these extra independence relations? The answer is “d- 
separation” which means “directed separation.” d-separation can be summa- 
rized by three rules. Consider the four DAG’s in Figure 17.6 and the DAG in 
Figure 17.7. The hrst 3 DAG’s in Figure 17.6 have no colliders. The DAG in 
the lower right of Figure 17.6 has a collider. The DAG in Figure 17.7 has a 
collider with a descendant. 
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\ 

^3 

/\ 

Xi Xs 

\/ 

^4 

FIGURE 17.5. And yet another DAG. 




FIGURE 17.6. The hrst three DAG’s have no colliders. The fourth DAG in the lower 
right corner has a collider at Y. 




FIGURE 17.7. A collider with a descendant. 
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X 



^ u 

1 









V 






I 



S2 



FIGURE 17.8. d-separation explained. 



Y 



The Rules of d-Separation 
Consider the DAGs in Figures 17.6 and 17.7. 

1. When Y is not a collider, X and Z are d-connected, but they are 

d-separated given Y. 

2. If X and Z collide at Y, then X and Z are d-separated, but they 

are d-connected given Y. 

3. Conditioning on the descendant of a collider has the same effect as 

conditioning on the collider. Thus in Figure 17.7, X and Z are 
d-separated but they are d-connected given W. 



Here is a more formal definition of d-separation. Let X and Y be distinct 
vertices and let VF be a set of vertices not containing X oi Y. Then X and 
Y are d-separated given W if there exists no undirected path U between 
X and Y such that (i) every collider on U has a descendant in VF, and (ii) 
no other vertex on U is in FF. If A, H, and VF are distinct sets of vertices and 
A and B are not empty, then A and B are d-separated given FF if for every 
X G A and Y e B, X and Y are d-separated given VF. Sets of vertices that 
are not d-separated are said to be d-connected. 

17.9 Example. Consider the DAG in Figure 17.8. From the d-separation rules 
we conclude that: 

X and Y are d-separated (given the empty set); 

X and Y are d-connected given {aSi, AS 2 }; 

X and Y are d-separated given {Si^ S2^V}. 

17.10 Theorem. ^ Let A, B, andC be disjoint sets ofvertiees. Then AUB \ C 
if and only if A and B are d-separated by C. 



^We implicitly assume that P is faithful to Q which means that P has no extra independence 
relations other than those logically implied by the Markov Condition. 
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FIGURE 17.9. Jordan’s alien example (Example 17.11). Was your friend kidnapped 
by aliens or did you forget to set your watch? 

17.11 Example. The fact that conditioning on a collider creates dependence 
might not seem intuitive. Here is a whimsical example from Jordan (2004) that 
makes this idea more palatable. Your friend appears to be late for a meeting 
with you. There are two explanations: she was abducted by aliens or you forgot 
to set your watch ahead one hour for daylight savings time. (See Figure 17.9.) 
Aliens and Watch are blocked by a collider which implies they are marginally 
independent. This seems reasonable since — before we know anything about 
your friend being late — we would expect these variables to be independent. 
We would also expect that P(Aliens = yesjLate = yes) > P(Aliens = yes); 
learning that your friend is late certainly increases the probability that she 
was abducted. But when we learn that you forgot to set your watch properly, 
we would lower the chance that your friend was abducted. Hence, P( Aliens = 
yesjLate = yes) ^ P(Aliens = yesjLate = yes, Watch = no). Thus, Aliens and 
Watch are dependent given Late. ■ 

17.12 Example. Consider the DAG in Figure 17.2. In this example, over- 
weight and smoking are marginally independent but they are dependent given 
heart disease. ■ 

Graphs that look different may actually imply the same independence re- 
lations. If 0 is a DAG, we let T(^) denote all the independence statements 
implied by Q. Two DAGs Q\ and for the same variables V are Markov 
equivalent if T{Q\) = X{Q^. Given a DAG let skeleton (^) denote the 
undirected graph obtained by replacing the arrows with undirected edges. 

17.13 Theorem. Two DAGs Gi and Q 2 are Markov equivalent if and only if 
(i) skeleton(^i) = skeleton(^ 2 ) CL'^d (ii) Qi and Q 2 have the same unshielded 
eolliders. 

17.14 Example. The first three DAGs in Figure 17.6 are Markov equivalent. 
The DAG in the lower right of the Figure is not Markov equivalent to the 
others. ■ 
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17.6 Estimation for DAGs 

Two estimation questions arise in the context of DAGs. First, given a DAG 
Q and data Vi, . . . , from a distribution / consistent with how do we 
estimate /? Second, given data Vi, . . . , how do we estimate Q1 The first 
question is pure estimation while the second involves model selection. These 
are very involved topics and are beyond the scope of this book. We will just 
briefly mention the main ideas. 

Typically, one uses some parametric model /(x|7t^; for each conditional 
density. The likelihood function is then 

n n m 

m = Y[m-,e) = 

i=l i=l j=l 

where Xij is the value of Xj for the data point and Oj are the parameters for 
the conditional density. We can then estimate the parameters by maximum 
likelihood. 

To estimate the structure of the DAG itself, we could fit every possible DAG 
using maximum likelihood and use AIC (or some other method) to choose a 
DAG. However, there are many possible DAGs so you would need much data 
for such a method to be reliable. Also, searching through all possible DAGs 
is a serious computational challenge. Producing a valid, accurate confidence 
set for the DAG structure would require astronomical sample sizes. If prior 
information is available about part of the DAG structure, the computational 
and statistical problems are at least partly ameliorated. 



17.7 Bibliographic Remarks 

There are a number of texts on DAGs including Edwards (1995) and Jordan 
(2004). The first use of DAGs for representing causal relationships was by 
Wright (1934). Modern treatments are contained in Spirtes et al. (2000) and 
Pearl (2000). Robins et al. (2003) discuss the problems with estimating causal 
structure from data. 



17.8 Appendix 

Gausation Revisited. We discussed causation in Chapter 16 using the idea 
of counterfact ual random variables. A different approach to causation uses 
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X ^ Y ^ Z 

FIGURE 17.10. Conditioning versus intervening. 

DAGs. The two approaches are mathematically equivalent though they appear 
to be quite different. In the DAG approach, the extra element is the idea of 
intervention. Consider the DAG in Figure 17.10. 

The probability function for a distribution consistent with this DAG has 
the form f{x^y^z) = f {x) f {y\x) f {z\x ^ y) . The following is pseudocode for 
generating from this distribution. 



For i 






Xi 


< 


Px{xi) 


Vi 


<- 


PY\x{yi\Xi) 


Zi 


<- 


Pz\x,Y{zi\xi, yi) 



Suppose we repeat this code many times, yielding data (xi, ^i, zi), . . . , 
Among all the times that we observe Y = y^ how often is Z = z? The answer 
to this question is given by the conditional distribution of Z\Y . Specihcally, 



P(Z = z\Y = y) 



¥{y = y.Z = z) _ f{y,z) 

P(F = y) f{y) 

^ T.:,fix)fiy\x)f(z\x,y) 

f{y) f{y) 



I ^/(yk)/(a;) 






f{x,y) 

f(y) 



Now suppose we intervene by changing the computer code. Specihcally, sup- 
pose we hx Y at the value y. The code now looks like this: 



set Y = 


y 


for i = 




Xi <- 


Px{Xi) 


z% 


Pz\x,Y{zi\xi, y) 
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Having set Y = y, how often was Z = z? To answer, note that the inter- 
vention has changed the joint probability to be 

f*(x,z) = f{x)f{z\x,y). 

The answer to our question is given by the marginal distribution 

f*(z) = 

X X 

We shall denote this as P(Z = z\Y := y) or f{z\Y := y). We call P(Z = 

z\Y = y) conditioning by observation or passive conditioning. We call 
¥{Z = z\Y := y) conditioning by intervention or active conditioning. 

Passive conditioning is used to answer a predictive question like: 

“Given that Joe smokes, what is the probability he will get lung cancer?’’ 
Active conditioning is used to answer a causal question like: 

“If Joe quits smoking, what is the probability he will get lung cancer?’’ 
Consider a pair (^,P) where ^ is a DAG and P is a distribution for the 
variables V of the DAG. Let p denote the probability function for P. Con- 
sider intervening and fixing a variable X to be equal to x. We represent the 
intervention by doing two things: 

(1) Create a new DAG by removing all arrows pointing into X; 

(2) Create a new distribution /*('c) = P(H = v\X := x) by removing the 
term f{x\7Tx) from f{v). 

The new pair (^*,/*) represents the intervention “set X = xT 

17.15 Example. You may have noticed a correlation between rain and having 
a wet lawn, that is, the variable “Rain” is not independent of the variable “Wet 
Lawn” and hence Pn^wiXi'^) ^ Pr{'^)pw{'^) where R denotes Rain and W 
denotes Wet Lawn. Consider the following two DAGs: 

Rain — ^ Wet Lawn Rain ^ — Wet Lawn. 

The first DAG implies that /(ic,r) = f{r)f{w\r) while the second implies 
that /(ic,r) = f{w)f{r\w) No matter what the joint distribution /(ic,r) is, 
both graphs are correct. Both imply that R and W are not independent. But, 
intuitively, if we want a graph to indicate causation, the first graph is right 
and the second is wrong. Throwing water on your lawn doesn’t cause rain. 
The reason we feel the first is correct while the second is wrong is because the 
interventions implied by the first graph are correct. 

Look at the first graph and form the intervention W = 1 where 1 denotes 
“wet lawn.” Following the rules of intervention, we break the arrows into W 
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to get the modified graph: 



Rain 



set Wet Lawn =1 



with distribution /*(r) = /(r). Thus P(R = r \ W := w) = P(R = r) tells us 
that “wet lawn” does not cause rain. 

Suppose we (wrongly) assume that the second graph is the correct causal 
graph and form the intervention VL = 1 on the second graph. There are no 
arrows into W that need to be broken so the intervention graph is the same 
as the original graph. Thus /*(r) = f{r\w) which would imply that changing 
“wet” changes “rain.” Clearly, this is nonsense. 

Both are correct probability graphs but only the first is correct causally. 
We know the correct causal graph by using background knowledge. 

17.16 Remark. We could try to learn the correct causal graph from data but 
this is dangerous. In fact it is impossible with two variables. With more than 
two variables there are methods that can find the causal graph under certain 
assumptions but they are large sample methods and, furthermore, there is no 
way to ever know if the sample size you have is large enough to make the 
methods reliable. 

We can use DAGs to represent confounding variables. If X is a treatment 
and Y is an outcome, a confounding variable Z is a variable with arrows into 
both X and Y] see Figure 17.11. It is easy to check, using the formalism of 
interventions, that the following facts are true: 

In a randomized study, the arrow between Z and X is broken. In this case, 
even with Z unobserved (represented by enclosing Z in a circle), the causal 
relationship between X and Y is estimable because it can be shown that 
E(y|X := x) = E(y|X = x) which does not involve the unobserved Z. In 
an observational study, with all confounders observed, we get E(T|X := x) = 
jE(y|X = X, Z = z)dFz(z) as in formula (16.7). If Z is unobserved then we 
cannot estimate the causal effect because E(T|X := x) = JE(T|X = x, Z = 
z)dFz{z) involves the unobserved Z. We can’t just use X and Y since in this 
case. E(y = y\X = x) ^ E(T = v\X := x) which is just another way of saying 
that causation is not association. 

In fact, we can make a precise connection between DAGs and counterfac- 
tuals as follows. Suppose that X and Y are binary. Define the confounding 
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FIGURE 17.11. Randomized study; Observational study with measured con- 
founders; Observational study with unmeasured confounders. The circled variables 
are unobserved. 



variable Z by 

' 1 if (Co, Cl) = (0,0) 

I 2 if (Co, Cl) = (0,1) 

3 if (Co, Cl) = (1,0) 

^4 if (Co, Cl) = (1,1). 

From this, you can make the correspondence between the DAG approach and 
the counterfactual approach explicit. I leave this for the interested reader. 



17.9 Exercises 



1. Show that (17.1) and (17.2) are equivalent. 

2. Prove Theorem 17.2. 



3. Let X, Y and Z have the following joint distribution: 



y = 0 y = 1 

X = 0 .405 .045 

X = 1 .045 .005 

Z = 0 



y =0 y = 1 

X = 0 .125 .125 

X = 1 .125 .125 

Z = 1 



(a) Find the conditional distribution of X and Y given Z = 0 and the 
conditional distribution of X and Y given Z = 1. 

(b) Show that Xny|Z. 

(c) Find the marginal distribution of X and Y . 

(d) Show that X and Y are not marginally independent. 

4. Consider the three DAGs in Figure 17.6 without a collider. Prove that 

xnz|y. 
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j 

I4 

t 



\ 



Y2 



\ 



Z2 



FIGURE 17.12. DAG for exercise 7. 

5. Consider the DAG in Figure 17.6 with a collider. Prove that XJlZ and 
that X and Z are dependent given Y. 

6. Let X G {0,1}, Y G {0,1}, Z G {0,1,2}. Suppose the distribution of 
(X, U, Z) is Markov to: 

^z 

Create a joint distribution /(x, y, z) that is Markov to this DAG. Gen- 
erate 1000 random vectors from this distribution. Estimate the distribu- 
tion from the data using maximum likelihood. Compare the estimated 
distribution to the true distribution. Let 0 = (6>ooo, ^ 001 , • • • , ^ 112 ) where 
Orst = IP(X = r^Y = s^Z = t). Use the bootstrap to get standard errors 
and 95 percent conhdence intervals for these 12 parameters. 

7. Consider the DAG in Figure 17.12. 

(a) Write down the factorization of the joint density. 

(b) Prove that X II Z^. 

8. Let V = (X, y, Z) have the following joint distribution 

X rsj Bernoulli 
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/ g4x-2 

y|X = x ~ Bernoulli 
Z \ X = x^Y = y Bernoulli 

(a) Find an expression for P(Z = z \ Y = ^). In particular, find P(Z = 

1 |y = 1). 

(b) Write a program to simulate the model. Conduct a simulation and 
compute P(Z = 1 I y = 1 ) empirically. Plot this as a function of 
the simulation size N . It should converge to the theoretical value you 
computed in (a). 

(c) (Refers to material in the appendix.) Write down an expression for 
P(Z = 1\Y := y). In particular, hnd P(Z = 1 | F := 1). 

(d) (Refers to material in the appendix.) Modify your program to sim- 
ulate the intervention “set Y = 1 .” Conduct a simulation and compute 
P(Z = 1 I y := 1 ) empirically. Plot this as a function of the simulation 
size N . It should converge to the theoretical value you computed in (c). 

9. This is a continuous, Gaussian version of the last question. Let V = 
(X, y, Z) have the following joint distribution 

X ^ Normal (0, 1) 

Y \ X = X ^ Normal (cex, 1) 

Z\X = x^Y = y ^ Normal {jSy + yx, 1). 

Here, ce, (3 and 7 are hxed parameters, economists refer to models like 
this as structural equation models. 

(a) Find an explicit expression for f{z \ y) and E(Z \ Y = y) = f zf{z \ 
y)dz. 

(b) (Refers to material in the appendix.) Find an explicit expression 
for f{z I y := y) and then hnd E(Z | Y := y) = f zf(z | Y := y)dy. 
Compare to (b). 

(c) Find the joint distribution of (y, Z). Find the correlation p between 
y and Z. 

(d) (Refers to material in the appendix.) Suppose that X is not observed 
and we try to make causal conclusions from the marginal distribution of 
(y, Z). (Think of X as unobserved confounding variables.) In particular. 
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suppose we declare that Y causes Z if p 0 and we declare that Y does 
not cause Z if p = 0. Show that this will lead to erroneous conclusions. 

(e) (Refers to material in the appendix.) Suppose we conduct a ran- 
domized experiment in which Y is randomly assigned. To be concrete, 
suppose that 



X 

Y 

Z\X = x,Y = y 



^ Normal(0, 1) 
Normal((a, 1) 

^ Normal(/3^ + 7 X, 1). 



Show that the method in (d) now yields correct conclusions (i.e., p = 0 
if and only if f{z \Y := y) does not depend on y). 




18 

Undirected Graphs 



Undirected graphs are an alternative to directed graphs for representing in- 
dependence relations. Since both directed and undirected graphs are used in 
practice, it is a good idea to be facile with both. The main difference between 
the two is that the rules for reading independence relations from the graph 
are different. 



18.1 Undirected Graphs 

An undirected graph Q = (U, E) has a finite set V of vertices (or nodes) 
and a set E of edges (or arcs) consisting of pairs of vertices. The vertices 
correspond to random variables X, T, Z, . . . and edges are written as unordered 
pairs. For example, (X, T) G E means that X and Y are joined by an edge. 
An example of a graph is in Figure 18.1. 

Two vertices are adjacent, written X ^ T, if there is an edge between 
them. In Figure 18.1, X and Y are adjacent but X and Z are not adjacent. A 
sequence Xq, . . . , X^ is called a path if X^_i ^ Xi for each i. In Figure 18.1, 
X, T, Z is a path. A graph is complete if there is an edge between every pair 
of vertices. A subset U C V of vertices together with their edges is called a 
subgraph. 
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Y 




Z 



FIGURE 18.1. A graph with vertices V = {X, U, Z}. The edge set is 

E = {(X,Y),(Y,Z)}. 




FIGURE 18.2. {y, W} and {Z} are separated by {-A}. Also, W and Z are separated 
by {X,y}. 



If A, 5 and C are three distinct subsets of U, we say that C separates 
A and B if every path from a variable in A to a variable in B intersects a 
variable in C. In Figure 18.2 {U, W} and {Z} are separated by {X}. Also, W 
and Z are separated by {X, U}. 



18.2 Probability and Graphs 

Let U be a set of random variables with distribution P. Construct a graph 
with one vertex for each random variable in V. Omit the edge between a pair 
of variables if they are independent given the rest of the variables: 



no edge between X and Y 



Xny|rest 
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Y 




FIGURE 18.3. XUZjV. 
Y 




FIGURE 18.4. No implied independence relations. 

where “rest” refers to all the other variables besides X and Y. The resulting 
graph is called a pairwise Markov graph. Some examples are shown in 
Figures 18.3, 18.4, 18.5, and 18.6. 

The graph encodes a set of pairwise conditional independence relations. 
These relations imply other conditional independence relations. How can we 
figure out what they are? Fortunately, we can read these other conditional 
independence relations directly from the graph as well, as is explained in the 
next theorem. 

18.1 Theorem. Let Q = {V^E) he a pairwise Markov graph for a distribution 
P. Let A^B and C be distinet subsets of V sueh that C separates A and B. 
ThenAUB\C. 

18.2 Remark. If A and B are not connected (i.e., there is no path from A to 
B) then we may regard A and B as being separated by the empty set. Then 
Theorem 18.1 implies that AUB. 
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X 



Y 



W 



FIGURE 18.5. X U Z\{Y, W} and Y R VF|{X, Z}. 



X 



Y 



Z 



W 



FIGURE 18.6. Pairwise independence implies that XIIZ|{y, VF}. But is XllZjy? 



The independence condition in Theorem 18.1 is called the global Markov 
property. We thus see that the pairwise and global Markov properties are 
equivalent. Let us state this more precisely. Given a graph let Mpair(^) 
be the set of distributions which satisfy the pairwise Markov property: thus 
P G Mpair(Q) if, under P, X II T|rest if and only if there is no edge between 
X and Y. Let Mgiobai(S) be the set of distributions which satisfy the global 
Markov property: thus P G Mpair(S) if, under P, X II B\C if and only if C 
separates A and B. 

18.3 Theorem. Let Q be a graph. Then, Mpair(S) = ^global (0)* 

Theorem 18.3 allows us to construct graphs using the simpler pairwise prop- 
erty and then we can deduce other independence relations using the global 
Markov property. Think how hard this would be to do algebraically. Returning 
to 18.6, we now see that X II Z\Y and Y II W\Z . 

18.4 Example. Figure 18.7 implies that X II T, X II Z and X II (T, Z). ■ 



18.5 Example. Figure 18.8 implies that X II W\{Y, Z) and X II Z\Y . m 
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Y 



X • 



Z 



FIGURE 18.7. XUY, XUZ and XU (Y,Z). 

X Y 

z 

FIGURE 18.8. XMW\{Y,Z) a.nAXMZ\Y. 

18.3 Cliques and Potentials 

A clique is a set of variables in a graph that are all adjacent to each other. A 
set of variables is a maximal clique if it is a clique and if it is not possible 
to include another variable and still be a clique. A potential is any positive 
function. Under certain conditions, it can be shown that P is Markov Q if and 
only if its probability function / can be written as 

m = ( 18 . 1 ) 

where C is the set of maximal cliques and 

^ = ER V'cCc)- 

a: C€C 

18.6 Example. The maximal cliques for the graph in Figure 18.1 are Ci = 
{X, Y} and C2 = {T, Z}. Hence, if P is Markov to the graph, then its proba- 
bility function can be written 

f{x,y,z) oc xpi{x,y)i)2{y,z) 



for some positive functions ipi and '02- ■ 
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^2 



X4 




FIGURE 18.9. The maximumly cliques of this graph are 
{Xl,X2},{Xi,X3},{X2,X4},{X3,X5},{X2,X5,X6}. 

18.7 Example. The maximal cliques for the graph in Figure 18.9 are 
{Xi, X 2 }, {Xi, X 3 }, {X 2 , X 4 }, {X 3 , X 5 }, {X 2 , X 5 , Xe}. 

Thus we can write the probability function as 

/(xi,a;2,X3,a;4,X5,a;6) oc ipi2{xi,X2)i)i3{xi,x:i)ip2-i{x2,Xi) 

xi’S5{xs,X5)-)p256{x2,X5,Xe). ■ 



18.4 Fitting Graphs to Data 

Given a data set, how do we find a graphical model that fits the data? As 
with directed graphs, this is a big topic that we will not treat here. However, 
in the discrete case, one way to fit a graph to data is to use a log-linear 
model, which is the subject of the next chapter. 



18.5 Bibliographic Remarks 

Thorough treatments of undirected graphs can be found in Whittaker (1990) 
and Lauritzen (1996). Some of the exercises below are from Whittaker (1990). 



18.6 Exercises 

1. Consider random variables (Xi, W 2 ,X 3 ). In each of the following cases, 
draw a graph that has the given independence relations. 
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Xi X2 X4 



X3 

FIGURE 18.10. 

Xi X2 X3 X4 



FIGURE 18.11. 



(a) XiEX3 1X2. 

(b) Xi n X2 I X3 and Xi U X3 I X2. 

(c) Xi n X2 I X3 and Xi U X3 I X2 and X2 U X3 | Xi. 

2. Consider random variables (Xi, X2, X3, X4). In each of the following 
cases, draw a graph that has the given independence relations. 

(a) XiEX 3 I X2,X4 and Xi E X4 | X2,X3 and X2 E X4 | Xi,X3. 

(b) Xi E X2 I X3, X4 and Xi E X3 | X2, X4 and X2 E X3 | Xi, X4. 

(c) Xi E X3 I X2, X4 and X2 E X4 | Xi, X3. 

3. A conditional independence between a pair of variables is minimal if it 
is not possible to use the Separation Theorem to eliminate any variable 
from the conditioning set, i.e. from the right hand side of the bar Whit- 
taker (1990). Write down the minimal conditional independencies from: 
(a) Figure 18.10; (b) Figure 18.11; (c) Figure 18.12; (d) Figure 18.13. 

4. Let Xi,X2,X3 be binary random variables. Construct the likelihood 
ratio test for 

Ho : Xi EX2IX3 versus Hi : Xiis not independent of X2IX3. 

5. Here are breast cancer data from Morrison et al. (1973) on diagnostic 
center (Xi), nuclear grade (X2), and survival (X3): 
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X3 



X4 



X2 



Xi 



FIGURE 18.12. 




FIGURE 18.13. 



w 


malignant 


malignant 


benign 


benign 


w 


died 


survived 


died 


survived 


Xi Boston 


35 


59 


47 


112 


Glamorgan 


42 


77 


26 


76 



(a) Treat this as a multinomial and find the maximum likelihood esti- 
mator. 

(b) If someone has a tumor classified as benign at the Glamorgan clinic, 
what is the estimated probability that they will die? Find the standard 
error for this estimate. 
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(c) Test the following hypotheses: 



XinX2|X3 

XinX3|X2 

X2nX3|Xi 



versus Xi TMT X2IX3 

versus Xi TMT X3IX2 

versus X2 X3IX1 



Use the test from question 4. Based on the results of your tests, draw 
and interpret the resulting graph. 




19 

Log-Linear Models 



In this chapter we study log- linear models which are useful for modeling 
multivariate discrete data. There is a strong connection between log-linear 
models and undirected graphs. 



19.1 The Log-Linear Model 

Let X = (Xi, . . . , Xm) be a discrete random vector with probability function 

f{x) = ¥{X = x) = P(Xi = Xi,...,Xm = Xm) 

where x = (xi, . . . , Xm)- Let rj be the number of values that Xj takes. Without 
loss of generality, we can assume that Xj G {0, 1, . . . — 1}. Suppose now 

that we have n such random vectors. We can think of the data as a sample 
from a Multinomial with X = ri x T 2 x • • • x categories. The data can be 
represented as counts in a ri x T 2 x • • • x table. Let p = (pi, . . . ,Pw) denote 
the multinomial parameter. 

Let S = m}. Given a vector x = (xi, . . . , Xm) and a subset A C S, 

let xa = {xj : j G A). For example, if A = {1, 3} then xa = (xi^xs). 
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19.1 Theorem. The joint probability function f{x) of a single random vector 
X = (Xi, . . . , Xjjf) can be written as 

\ogf{x) ='^i)A{x) (19.1) 

Acs 

where the sum is over all subsets A of S = {1, , m} and the ip ^s satisfy the 
following conditions: 

1. ' 00 (x) is a constant; 

2. For every A C S, is only a function of x a and not the rest of the 

x^s. 

3. If i ^ A and Xi = 0 , then jjAi^) = 0* 

The formula in equation (19.1) is called the log- linear expansion of /. 
Each Pja{x) may depend on some unknown parameters /3 a. Let (3 = {(3a • 
A C S) he the set of all these parameters. We will write f{x) = f{x; (3) when 
we want to emphasize the dependence on the unknown parameters [3. 

In terms of the multinomial, the parameter space is 

r ^ 

V = <p= (pi, ...,pn)- Pj> 0, '^Pj = 1 

i=i 

This is an X — 1 dimensional space. In the log-linear representation, the pa- 
rameter space is 



Q = 1^0 = Ai,- ■ - An) ■■ 0 = 0{p),p ev'^ 

where (3{p) is the set of (3 values associated with p. The set © is a X — 1 
dimensional surface in R^. We can always go back and forth between the two 
parameterizations we can write (3 = (3 {p) and p = p{(3). 

19.2 Example. Let X ^ Bernoulli(p) where 0 < p < 1. We can write the 
probability mass function for X as 

/(x)=p"(i-p)'-" = p?py* 

for X = 0 , 1 , where pi = p and p 2 = I — p^ Hence, 



log /(a;) = '00 (x) + 01 (x) 
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where 

V’0(a:) = log(p2) 

ipi{x) = xlog( — ). 

\P 2 j 

Notice that ' 00 (x) is a constant (as a function of x) and = 0 when x = 0. 

Thus the three conditions of Theorem 19.1 hold. The log-linear parameters 
are 

/?0 = log(p2), A = log . 

The original, multinomial parameter space is P = {(pi,P 2 ) • Pj > 0 ,Ti+T 2 = 
1}. The log-linear parameter space is 

e = |(/?o,/?i) e +e^° = l.j 

Given (pi,p 2 ) we can solve for (/3 q, (3i). Conversely, given (/3 q, /3i) we can solve 
for (pi,P 2 ). ■ 

19.3 Example. Let X = (Xi,X 2 ) where Xi e {0, 1} and X2 G {0, 1 , 2 }. The 
joint distribution of n such random vectors is a multinomial with 6 categories. 
The multinomial parameters can be written as a 2-by-3 table as follows: 

multinomial X 2 0 1 2 

xi 0 Poo Poi P02 

1 Pio Pii P12 

The n data vectors can be summarized as counts: 

data X 2 0 1 2 

xi 0 Goo Coi C 02 

1 <^10 <^11 <^12 

For X = (xi, X 2 ), the log-linear expansion takes the form 

log f{x) = V ^0 (x) + (x) + V ^2 (x) + V ^12 (x) 

where 

' 00 (x) = log Poo 

'l/jl{x) = Xi\og( — ^ 

\PooJ 

i) 2 {x) = I{X 2 = 1 ) log f ) + I{X 2 = 2 ) log 

\PooJ \PooJ 

ipl 2 {x) = I{xi = 1 ,X 2 = l)l 0 g +/(xi = 1 ,X 2 = 2 ) log . 

VToiTioy VT 02 T 10 / 
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Convince yourself that the three conditions on the ^0’s of the theorem are 
satished. The six parameters of this model are: 

/3l = logpoo 02 = log (§^) 03 = log 

■ 

/34 = logf^') /35 = log /36 = logf^i^V 

O ypQQ J h'O y^oiPlO y ypQ2PlO J 

The next theorem gives an easy way to check for conditional independence 
in a log-linear model. 

19.4 Theorem. Let (Xa,X 5 ,Xc) be a partition of a vectors {Xi, X^). 
Then II Xc\Xa if and only if all the fj -terms in the log-linear expansion 
that have at least one coordinate in b and one coordinate in c are 0. 

To prove this theorem, we will use the following lemma whose proof follows 
easily from the dehnition of conditional independence. 

19.5 Lemma. A partition (Xa,X 5 ,Xc) satisfies X 5 IL Xc\Xa if and only if 
f{xa^Xh^Xc) = g{xa^X}f)h{xa^Xc) for somc functions g and h 

Proof. (Theorem 19.4.) Suppose that fjt Is 0 whenever t has coordinates 
in b and c. Hence, is 0 if t ^ a |J 6 or t ^ a |J c. Therefore 

log/(x)= Yi 0t{x)+ Y ^t{x) 

teia\Jh teia\Jc tea 

Exponentiating, we see that the joint density is of the form g{xa^ x\f)h(xa^ Xc). 
By Lemma 19.5, X^IlXclX^. The converse follows by reversing the argument. 



19.2 Graphical Log-Linear Models 

A log-linear model is graphical if missing terms correspond only to condi- 
tional independence constraints. 

19.6 Definition. Let log/(x) = ^ log-linear model. Then 

f is graphical if all 'll) -terms are nonzero except for any pair of 
coordinates not in the edge set for some graph Q. In other words, 
tpAix) = 0 ctnd only if {i^j} C A and {i,j) is not an edge. 



Here is a way to think about the definition above: 
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Xs 



X4 




FIGURE 19.1. Graph for Example 19.7. 

If you can add a term to the model and the graph does not change, 
then the model is not graphical. 

19.7 Example. Consider the graph in Figure 19.1. 

The graphical log-linear model that corresponds to this graph is 

log f{x) = ^00 + ^01 (x) + 02 (x) + 03 (x) + 04 (x) + 05 (x) 

+ 012 (x) + 023 (^) + 025 (^) + 034 (^) + 035 (^) + 045 (^) + 0235 (^) + 0345 (^)- 

Let’s see why this model is graphical. The edge (1,5) is missing in the graph. 

Hence any term containing that pair of indices is omitted from the model. For 
example, 



015, 0125, 0135, 0145, 01235 , 01245 , 01345 , 012345 
are all omitted. Similarly, the edge ( 2 , 4) is missing and hence 

024 , 0124 , 0234 , 0245 , 01234 , 01245 , 02345 , 012345 

are all omitted. There are other missing edges as well. You can check that the 
model omits all the corresponding 0 terms. Now consider the model 

logf { x ) = 00(x) +0l(x) +02(x) +03(^) +04(^) +05(^) 

+ 012 (^) + 023 (^) + 025 (^) + 034 (^) + 035 (^) + 045 (^)- 

This is the same model except that the three way interactions were removed. 
If we draw a graph for this model, we will get the same graph. For example, 
no 0 terms contain (1,5) so we omit the edge between Xi and X 5 . But this is 
not graphical since it has extra terms omitted. The independencies and graphs 
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X2 Xi X3 

FIGURE 19.2. Graph for Example 19.10. 

for the two models are the same but the latter model has other constraints 
besides conditional independence constraints. This is not a bad thing. It just 
means that if we are only concerned about presence or absence of conditional 
independences, then we need not consider such a model. The presence of the 
three-way interaction '0235 means that the strength of association between X 2 
and X 3 varies as a function of X 5 . Its absence indicates that this is not so. ■ 



19.3 Hierarchical Log-Linear Models 

There is a set of log-linear models that is larger than the set of graphical 
models and that are used quite a bit. These are the hierarchical log-linear 
models. 

19.8 Definition. A log-linear model is hierarchical i/0^ = 0 and A C B 
implies that = 0 . 



19.9 Lemma. A graphieal model is hierarehieal but the reverse need not be 
true. 

19.10 Example. Let 

logf{x) = l/' 0 (x) + + ■)p 2 {x) + 1 p 3 {x) + -Ipuix) + ->pl 3 {x). 

The model is hierarchical; its graph is given in Figure 19.2. The model is 
graphical because all terms involving (2,3) are omitted. It is also hierarchical. 



19.11 Example. Let 



log/(a;) = V’ 0 (a;) + + ->p 2 {x) + i^zix) + i^i 2 {x) + '013 (a;) + 023 (a;)- 
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X3 



FIGURE 19.3. The graph is complete. The model is hierarchical but not graphical. 



Xi X2 X3 

FIGURE 19.4. The model for this graph is not hierarchical. 

The model is hierarchical. It is not graphical. The graph corresponding to this 
model is complete; see Figure 19.3. It is not graphical because '0i23(^) = 0 
which does not correspond to any pairwise conditional independence. ■ 

19.12 Example. Let 



log/(a;) = + -ipzix) + 'ipi 2 {x). 

The graph corresponding is in Figure 19.4. This model is not hierarchical since 
= 0 but V^i2 is not. Since it is not hierarchical, it is not graphical either. ■ 



19.4 Model Generators 

Hierarchical models can be written succinctly using generators. This is most 
easily explained by example. Suppose that X = {Xi, X 2 , Xs). Then, M = 
1.2 + 1.3 stands for 

log / = + ^01 + t^2 + ^03 + ^012 + 013. 
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The formula M = 1.2+1. 3 says: “include i^i 2 and We have to also include 
the lower order terms or it won’t be hierarchical. The generator M = 1.2.3 is 
the saturated model 

log / = V^0 + V’l + V ^2 + V’s + ^012 + V’ls + ^023 + V’l23- 

The saturated models corresponds to fitting an unconstrained multinomial. 
Consider M = 1 + 2 + 3 which means 

log / = /;0 +7/^1 +V^2 + '03- 

This is the mutual independence model. Finally, consider M = 1.2 which has 
log-linear expansion 

log / = 00 + 01 + 02 + 012. 

This model makes Xs\X 2 = X 2 ,Xi = xi a uniform distribution. 



19.5 Fitting Log-Linear Models to Data 

Let (3 denote all the parameters in a log-linear model M. The loglikelihood 
for j3 is 

n 

m = J2^ogf{X,;p) 

i=l 

where f{Xi;(3) is the probability function for the 0^ random vector Xi = 
{Xii, . . . , Xijn) as give by equation (19.1). The mle (3 generally has to be 
found numerically. The Fisher information matrix is also found numerically 
and we can then get the estimated standard errors from the inverse Fisher 
information matrix. 

When htting log-linear models, one has to address the following model 
selection problem: which 0 terms should we include in the model? This is 
essentially the same as the model selection problem in linear regression. 

One approach is is to use AIC. Let M denote some log-linear model. Differ- 
ent models correspond to setting different 0 terms to 0. Now we choose the 
model M which maximizes 

AIC(M) = £{M) - \M\ (19.2) 

where \M\ is the number of parameters in model M and i{M) is the value 
of the log-likelihood evaluated at the mle for that model. Usually the model 
search is restricted to hierarchical models. This reduces the search space. Some 
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also claim that we should only search through the hierarchical models because 
other models are less interpretable. 

A different approach is based on hypothesis testing. The model that includes 
all possible ^0-terms is called the saturated model and we denote it by Mgat- 
Now for each M we test the hypothesis 

Hq : the true model is M versus Hi : the true model is Mgat- 
The likelihood ratio test for this hypothesis is called the deviance. 

19.13 Definition. For any submodel M, define the deviance dev(M) by 

dev(M) = 2{isat - £m) 

where igat is the log-likelihood of the saturated model evaluated at the mle 
and £m is the log -likelihood of the model M evaluated at its mle. 



19.14 Theorem. The devianee is the likelihood ratio test statistie for 

Hq : the model is M versus Hi : the model is Mgat- 

Under H^, dev{M) -4 with u degrees of freedom equal to the differenee in 
the number of parameters between the saturated model and M. 

One way to find a good model is to use the deviance to test every sub-model. 
Every model that is not rejected by this test is then considered a plausible 
model. However, this is not a good strategy for two reasons. First, we will end 
up doing many tests which means that there is ample opportunity for making 
Type I and Type II errors. Second, we will end up using models where we 
failed to reject Hq. But we might fail to reject Hq due to low power. The 
result is that we end up with a bad model just due to low power. 

After finding a “best model” this way we can draw the corresponding graph. 

19.15 Example. The following breast cancer data are from Morrison et al. 
(1973). The data are on diagnostic center (Xi), nuclear grade (X2), and sur- 
vival (X3): 



X2 


malignant 


malignant 


benign 


benign 


V 


died 


survived 


died 


survived 


Xi Boston 


35 


59 


47 


112 


Glamorgan 


42 


77 


26 


76 



The saturated log-linear model is: 
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Center 



Grade 



Survival 



FIGURE 19.5. The graph for Example 19.15. 



Variable 




f3j 




w, 


p-value 


(Intercept) 




3.56 


0.17 


21.03 


0.00 *** 


center 




0.18 


0.22 


0.79 


0.42 


grade 




0.29 


0.22 


1.32 


0.18 


survival 




0.52 


0.21 


2.44 


0.01 * 


center x grade 




-0.77 


0.33 


-2.31 


0.02 * 


center x survival 




0.08 


0.28 


0.29 


0.76 


grade X survival 




0.34 


0.27 


1.25 


0.20 


center x grade x survival 


0.12 


0.40 


0.29 


0.76 


sub-model, selected using AIC and backward searching is: 


Variable 


f3j 




w, 


p-value 


(Intercept) 


3.52 


0.13 


25.62 


< 0.00 *** 


center 


0.23 


0.13 


1.70 


0.08 




grade 


0.26 


0.18 


1.43 


0.15 




survival 


0.56 


0.14 


3.98 


6.65e-05 *** 


center x grade 


-0.67 


0.18 


-3.62 


0.00 




grade X survival 


0.37 


0.19 


1.90 


0.05 





The graph for this model M is shown in Figure 19.5. To test the fit of this 
model, we compute the deviance of M which is 0.6. The appropriate has 
8 — 6 = 2 degrees of freedom. The p-value is P(xi ^ = •'^4. So we have no 

evidence to suggest that the model is a poor fit. ■ 



19.6 Bibliographic Remarks 



For this chapter, I drew heavily on Whittaker (1990) which is an excellent 
text on log-linear models and graphical models. Some of the exercises are from 
Whittaker. A classic reference on log-linear models is Bishop et al. (1975). 
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19.7 Exercises 

1. Solve for the in terms of the /3’s in Example 19.3. 

2. Prove Lemma 19.5. 

3. Prove Lemma 19.9. 

4. Consider random variables (Xi, X2, X3, X4). Suppose the log-density is 

log/(x) = '00(x) +'0l2(x) +iAi 3(^) +'024(^) +'034(^). 

(a) Draw the graph G for these variables. 

(b) Write down all independence and conditional independence relations 
implied by the graph. 

(c) Is this model graphical? Is it hierarchical? 

5. Suppose that parameters p(xi,X2,X3) are proportional to the following 
values: 



X2 


0 


0 


1 


1 


X3 


0 


1 


0 


1 


xi 0 


2 


8 


4 


16 


1 


16 


128 


32 


256 



Find the i/^-terms for the log-linear expansion. Comment on the model. 

6. Let Xi, . . . 5X4 be binary. Draw the independence graphs correspond- 
ing to the following log-linear models. Also, identify whether each is 
graphical and/or hierarchical (or neither). 

(a) log/* = 7 “h Wx\ 2^2 1.6x3 4“ 1?X4 

(b) log / = 7 + llxi + 2x2 + 1.6x3 + 17x4 + 12x2X3 + 78x2X4 + 3x3X4 + 
32x2X3X4 

(c) log/* = 7-h llxi “h 2x2 “h 1.5x3 “h 17x4 “h 12X2X3 “h 3X3X4 -hXiX4 “h 2x^X2 

(d) log / = 7 + 5055x1X2X3X4 




20 

Nonpar ametric Curve Estimation 



In this Chapter we discuss nonparametric estimation of probability density 
functions and regression functions which we refer to as curve estimation or 
smoothing. 

In Chapter 7 we saw that it is possible to consistently estimate a cumulative 
distribution function F without making any assumptions about F. If we want 
to estimate a probability density function f{x) or a regression function r{x) = 
E(y|X = x) the situation is different. We cannot estimate these functions 
consistently without making some smoothness assumptions. Correspondingly, 
we need to perform some sort of smoothing operation on the data. 

An example of a density estimator is a histogram, which we discuss in 
detail in Section 20.2. To form a histogram estimator of a density /, we divide 
the real line to disjoint sets called bins. The histogram estimator is a piecewise 
constant function where the height of the function is proportional to number 
of observations in each bin; see Figure 20.3. The number of bins is an example 
of a smoothing parameter. If we smooth too much (large bins) we get a 
highly biased estimator while if we smooth too little (small bins) we get a 
highly variable estimator. Much of curve estimation is concerned with trying 
to optimally balance variance and bias. 
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d{x) 



This is a function of the data This is the point at which we are 

evaluating g{-) 

FIGURE 20.1. A curve estimate 'g is random because it is a function of the data. 
The point x at which we evaluate 'g is not a random variable. 

20.1 The Bias- Variance Tradeoff 

Let g denote an unknown function such as a density function or a regression 
function. Let 'gn denote an estimator of g. Bear in mind that ^n(^) is a random 
function evaluated at a point x. The estimator is random because it depends 
on the data. See Figure 20.1. 

As a loss function, we will use the integrated squared error (ISE): ^ 

r(5, gn) = j {g{u) - gn{u)f du. (20.1) 

The risk or mean integrated squared error (MISE) with respect to 
squared error loss is 

R{fJ)=E(^L{g,g)y (20.2) 

20.1 Lemma. The risk can be written as 

R{9^dn) = J b‘^{x)dx-\- J v{x)dx (20.3) 

where 

b{x) = E{gr^{x)) - g{x) (20.4) 

is the bias of'gn{x) at a fixed x and 

v{x) = y{gn{x)) = e((^„(x) - E{gn{x)f)^ (20.5) 

is the variance ofgn{x) at a fixed x. 

^We could use other loss functions. The results are similar but the analysis is much more 
complicated. 
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FIGURE 20.2. The Bias- Variance trade-off. The bias increases and the variance de- 
creases with the amount of smoothing. The optimal amount of smoothing, indicated 
by the vertical line, minimizes the risk = bias^ + variance. 

In summary, 



RISK = BIAS^ + VARIANCE. (20.6) 



When the data are over smoothed, the bias term is large and the variance 
is small. When the data are undersmoothed the opposite is true; see Figure 
20.2. This is called the bias- variance tradeoff. Minimizing risk corresponds 
to balancing bias and variance. 



20.2 Histograms 

Let Xi, . . . , Xn be IID on [0, 1] with density /. The restriction to [0, 1] is not 
crucial; we can always rescale the data to be on this interval. Let m be an 
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integer and define bins 



r 1 \ 


,B2 = 


■ 1 


2 \ 


. . . , Bqjl 


m — 1 


■ 


0,- 




— , 




,1 


m J 




m 


m J 


m 





(20.7) 



Define the binwidth h = 1/m, let Uj be the number of observations in 
let pj = i^jln and let pj = f(u)du. 

The histogram estimator is defined by 



fn{x) = I 



Pi/h 


X e Bi 


n/h 


X e B2 




X G Bqjl 



which we can write more succinctly as 

n ^ 

fn{x) = Bj). 

J = 1 



(20.8) 



To understand the motivation for this estimator, let pj = f{u)du and note 
that, for X G Bj and h small. 



E(/n(x)) = 



lE(Pj) 

h 



El 

h 



fg, f(u)du 



20.2 Example. Figure 20.3 shows three different histograms based on n = 
1,266 data points from an astronomical sky survey. Each data point repre- 
sents the distance from us to a galaxy. The galaxies he on a “pencilbeam” 
pointing directly from the Earth out into space. Because of the finite speed of 
light, looking at galaxies farther and farther away corresponds to looking back 
in time. Choosing the right number of bins involves finding a good tradeoff 
between bias and variance. We shall see later that the top left histogram has 
too few bins resulting in oversmoothing and too much bias. The bottom left 
histogram has too many bins resulting in undersmoothing and too few bins. 
The top right histogram is just right. The histogram reveals the presence of 
clusters of galaxies. Seeing how the size and number of galaxy clusters varies 
with time, helps cosmologists understand the evolution of the universe. ■ 



The mean and variance of fn{x) are given in the following Theorem. 



20.3 Theorem. Consider fixed x and fixed m, and let Bj be the bin eontaining 
X. Then, 



E(/„(x)) = ^ and Y{U{x)) = ^TL^ 



(20.9) 
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Let’s take a closer look at the bias- variance tradeoff using equation (20.9). 



Consider some x e Bj. For any other u e Bj, 



f{u) w f{x) + {u- x)f{x) 



and so 



Pj = / f{u)du 
JB; 



Therefore, the bias h{x) is 



f{x) + (ix - x)f{x) ]du 



f{x)h + hf{x) (^h ■ 



b(x) = E(f„(x)) - f(x) = f - f(x) 

f(x)h + hf'(x) (h (j -\) -x) 



-X . 



If Xj is the center of the bin, then 



b‘^{x) dx 



{f\x)f {h{j--]-x] dx 



ifixj)) / [hij--] - x] dx 






Therefore, 



b^{x)dx = N/ ~ 



h2 in r2 /■! 

~ Y2 Jo 



Note that this increases as a function of h. Now consider the variance. For h 
small, 1 — pj 1, so 

f{x)h + hf'jx) {h {j - ^) -x) 
nh^ 

« LL 

nh 
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20.4 Theorem. Suppose that f {f'{u))‘^du < oo. Then 



R(Ln^'^J(nu)fdu+± 



The value h* that minimizes (20.10) is 



^ 1/3 yj (^f'(^u))‘^du 



With this ehoiee of binwidth, 



R(fnJ) 



where C = (3/4)^/^ J {f'{u))‘^du 



Theorem 20.4 is quite revealing. We see that with an optimally chosen bin- 
width, the MISE decreases to 0 at rate By comparison, most parametric 

estimators converge at rate The slower rate of convergence is the price 
we pay for being nonparametric. The formula for the optimal binwidth h* is 
of theoretical interest but it is not useful in practice since it depends on the 
unknown function /. 

A practical way to choose the binwidth is to estimate the risk function 
and minimize over h. Recall that the loss function, which we now write as a 
function of h, is 

L{h) = j {fn{x) - f{x) f dx 

= J fnix)dx-2 j f„{x)f(x)dx + j f{x)dx. 

The last term does not depend on the binwidth h so minimizing the risk is 
equivalent to minimizing the expected value of 

J{h) = j f^{x)dx -2 J fn{x)f{x)dx. 
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We shall refer to E(J(/i)) as the risk, although it differs from the true risk by 
the constant term f p{x)dx. 



20.5 Definition. The cross-validation estimator of risk is 



m = J (jn{x)^ dx - (20.13) 



where f{-i) is the histogram estimator obtained after removing the i^^ 
observation. We refer to J(h) as the cross-validation score or estimated 
risk. 



20.6 Theorem. The cross-validation estimator is nearly unbiased: 

E(J(x)) ?^E(J(x)). 

In principle, we need to recompute the histogram n times to compute J{h). 
Moreover, this has to be done for all values of h. Fortunately, there is a 
shortcut formula. 



20.7 Theorem. The following identity holds: 



J{h) 



2 

(n — l)h 






(n - 1) ^ 

j=i 



(20.14) 



20.8 Example. We used cross-validation in the astronomy example. The cross- 
validation function is quite flat near its minimum. Any m in the range of 73 to 
310 is an approximate minimizer but the resulting histogram does not change 
much over this range. The histogram in the top right plot in Figure 20.3 was 
constructed using m = 73 bins. The bottom right plot shows the estimated 
risk, or more precisely, A, plotted versus the number of bins. ■ 



Next we want a conhdence set for /. Suppose fn is a histogram with m bins 
and binwidth h = 1/m. We cannot realistically make conhdence statements 
about the hne details of the true density /. Instead, we shall make conhdence 
statements about / at the resolution of the histogram. To this end, dehne 

7n(x) = V.{fn{x)) = y for X e Bj (20.15) 

where pj = f{u)du. Think of f{x) as a “histogramized” version of /. 
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20.9 Definition. A pair of functions ^ Un{x)) is a 1 — a confidence 

band (or confidence envelope) if 



£{x) < fn{x) < u{x) for all X > 1 — a. 



(20.16) 



20.10 Theorem. Let m = m(n) be the number of bins in the histogram /^. 
Assume that m(n) ^ oo and m(n) logn/n ^0 as n ^ oo. Define 



in{x) = max<^ a//^(x) - c,0 



^n{x) = [yfn{x)^C 



(20.17) 



^a/(2m) 

2 V n * 

Then, {£n{x) , Un{x)) is an approximate 1 — a eonfidenee band. 



(20.18) 



Proof. Here is an outline of the proof. From the central limit theorem, pj ^ 
N{pj,pj{l —pj)/n). By the delta method, ~ V(4^))- Moreover, 

it can be shown that the are approximately independent. Therefore, 




Vp, - Zj 



(20.19) 



where Zi, . . . , ^ ^(0, !)• Let 



A= <£n{x) < f^{x) <Un{x) for all X ^ max U//n(x) - a//(x) <c}. 



P(2p) = Pimax A//n(x 




X) >c)=P max >c 



max 2^/n - ^Pj > ^«/(2m) 



I ^ (2m) 



E OL 
m 
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0.00 O.Oft 0.10 O.lfi 0.90 



FIGURE 20.4. 95 percent confidence envelope for astronomy data using m = 73 
bins. 

20.11 Example. Figure 20.4 shows a 95 percent confidence envelope for the 
astronomy data. We see that even with over 1,000 data points, there is still 
substantial uncertainty. ■ 



20.3 Kernel Density Estimation 



Histograms are discontinuous. Kernel density estimators are smoother and 
they converge faster to the true density than histograms. 

Let Xi, . . . ,X^ denote the observed data, a sample from /. In this chap- 
ter, a kernel is dehned to be any smooth function K such that K(x) > 0, 
f K(x) dx = 1, f xK(x)dx = 0 and = f x^K(x)dx > 0. Two examples of 
kernels are the Epanechnikov kernel 



K(x) 



|(1 - x2/5)/-v/ 5 I a; I < Vs 
0 otherwise 



( 20 . 20 ) 



and the Gaussian (Normal) kernel K{x) 



(27r)-i/2e-V2. 




An example of a kernel density estimator is show in Figure 20.5. The kernel 
estimator effectively puts a smoothed-out lump of mass of size 1/n over each 
data point The bandwidth h controls the amount of smoothing. When h 
is close to 0, /n consists of a set of spikes, one at each data point. The height 
of the spikes tends to infinity as h ^ 0. When h ^ oo, tends to a uniform 
density. 
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20.13 Example. Figure 20.6 shows kernel density estimators for the astron- 
omy data using three different bandwidths. In each case we used a Gaussian 
kernel. The properly smoothed kernel density estimator in the top right panel 
shows similar structure as the histogram. However, it is easier to see the clus- 
ters with the kernel estimator. ■ 



To construct a kernel density estimator, we need to choose a kernel K and 
a bandwidth h. It can be shown theoretically and empirically that the choice 
of K is not crucial. ^ However, the choice of bandwidth h is very important. 
As with the histogram, we can make a theoretical statement about how the 
risk of the estimator depends on the bandwidth. 



20.14 Theorem. Under weak assumptions on f and K, 

R{f, In) « j (/T))^ + ^ (20.22) 

where = J x‘^K{x)dx. The optimal bandwidth is 

- 2/5 1/5 - 1/5 

h* = ^ (20.23) 

where ci = f x‘^K{x)dx, C2 = f K{x)‘^dx and cs = J{f''{x))‘^dx. With this 
choice of bandwidth, 

for some constant C4 > 0. 



Proof. Write Kh{x,X) = h~^K{{x - X)/h) and /„(x) = n~^ Y.i Kh{x,Xi). 
Thus, E[/„(x)] = E[K^{x,X)] and V[/„(x)] = n-^Y[Kh{x,X)]. Now, 

E[Kh{x,X)] = J^K(^Yf{t)dt 



J K [u) f {x — hu) du 

f K[u) 



fix) - hf{x) + ^I'ix) + 



du 



/(x) + -/iV (x) / u^K{u)du- 



since J K{x) dx = 1 and J x K{x) dx = 0. The bias is 

E[Khix,X)]-fix)^^alh^f"ix). 



^It can be shown that the Epanechnikov kernel is optimal in the sense of giving smallest 
asymptotic mean squared error, but it is really the choice of bandwidth which is crucial. 
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FIGURE 20.6. Kernel density estimators and estimated risk for the astronomy data. 
Top left: oversmoothed. Top right: just right (bandwidth chosen by cross-validation). 
Bottom left: undersmoothed. Bottom right: cross-validation curve as a function of 
bandwidth h. The bandwidth was chosen to be the value of h where the curve is a 
minimum. 
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By a similar calculation, 

The result follows from integrating the squared bias plus the variance. ■ 

We see that kernel estimators converge at rate while histograms con- 
verge at the slower rate It can be shown that, under weak assumptions, 

there does not exist a nonparametric estimator that converges faster than 

n-4/5. 

The expression for h* depends on the unknown density / which makes 
the result of little practical use. As with the histograms, we shall use cross- 
validation to hnd a bandwidth. Thus, we estimate the risk (up to a constant) 
by 

C 2 ^ 

J{h) = / p{x)dz - - V/_i(V) (20.24) 

/ n 

^ i=l 

where /_* is the kernel density estimator after omitting the i**' observation. 



20.15 Theorem. For any h > 0, 



E 



J{h) 



E[J(h)]. 



Also, 

where iC*(x) = K^‘^\x) — 2K{x) and K^‘^\z) = f K(z — y)K{y)dy. In par- 
tieular, if K is a N (0,1) Gaussian kernel then K^‘^\z) is the A^(0, 2) density. 

We then choose the bandwidth h^ that minimizes J(h).^ A justihcation for 
this method is given by the following remarkable theorem due to Stone. 



20.16 Theorem (Stone’s Theorem). Suppose that f is bounded. Let fh denote 
the kernel estimator with bandwidth h and let h^ denote the bandwidth ehosen 
by cross-validation. Then, 

fix) - fh„ (x)) dx 

^ ^ ^ (20.26) 

inf/i/ (/(a;) - A(a;)) dx 



^For large data sets, / and (20.25) can be computed quickly using the fast Fourier transform. 
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20.17 Example. The top right panel of Figure 20.6 is based on cross-validation. 
These data are rounded which problems for cross-validation. Specihcally, it 
causes the minimizer to be h = 0. To overcome this problem, we added a 
small amount of random Normal noise to the data. The result is that J{h) is 
very smooth with a well dehned minimum. ■ 

20.18 Remark. Do not assume that, if the estimator / is wiggly, then cross- 
validation has let you down. The eye is not a good judge of risk. 



To construct confidence bands, we use something similar to histograms. 
Again, the confidence band is for the smoothed version, 

7 „ = nUx)) = [Ik f^) IN du, 



of the true density /. ^ Assume the density is on an interval (a, 6). The band 
is 

^n(x) = fn{x) - q se(x), Un{x) = fn{x) + q se{x) (20.27) 

where 



se(x) 

s‘^{x) 



Y,{x) 



q 

m 



1 _ 

^ - Y„(x) f, 






/l + (l-a)V- 



<h 

b — a 



where ou is the width of the kernel. In case the kernel does not have finite 
width then we take uu to be the effective width, that is, the range over which 
the kernel is non-negligible. In particular, we take co = 3h for the Normal 
kernel. 



20.19 Example. Figure 20.7 shows approximate 95 percent confidence bands 
for the astronomy data. ■ 



"^This is a modified version of the band described in Chaudhuri and Marron (1999). 
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FIGURE 20.7. 95 percent confidence bands for kernel density estimate for the as- 
tronomy data. 



Suppose now that the data Xi = (X^i, . . . , Xid) are d-dimensional. The ker- 
nel estimator can easily be generalized to d dimensions. Let h = (/ii, . . . , hd) 
be a vector of bandwidths and define 

I ^ 

fn{x) = -y^Khix - Xi) (20.28) 

i=l 




where fjj is the second partial derivative of /. The optimal bandwidth satisfies 
hi ~ leading to a risk of order From this fact, we see 
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that the risk increases quickly with dimension, a problem usually called the 
curse of dimensionality. To get a sense of how serious this problem is, 
consider the following table from Silverman (1986) which shows the sample 
size required to ensure a relative mean squared error less than 0.1 at 0 when 
the density is multivariate normal and the optimal bandwidth is selected: 

Dimension Sample Size 

1 V 

2 19 

3 67 

4 223 

5 768 

6 2790 

7 10,700 

8 43,700 

9 187,000 

10 842,000 

This is bad news indeed. It says that having 842,000 observations in a ten- 
dimensional problem is really like having 4 observations in a one-dimensional 
problem. 

20.4 Nonpar ametric Regression 

Consider pairs of points (xi, Ti), . . . , (x^, Yn) related by 

Vi = r(xi) + (20.30) 

where E(e^) = 0. We have written the x^’s in lower case since we will treat 
them as hxed. We can do this since, in regression, it is only the mean of Y 
conditional on x that we are interested in. We want to estimate the regression 
function r(x) = E(T|X = x). 

There are many nonpar ametric regression estimators. Most involve esti- 
mating r(x) by taking some sort of weighted average of the T^’s, giving higher 
weight to those points near x. A popular version is the Nadaraya- Watson 
kernel estimator. 

20.20 Definition. The Nadaraya- Watson kernel estimator is defined 

by 

n 

r(x) = ^Wi(x)Yi 



(20.31) 
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where K is a kernel and the weights Wi{x) are given by 



Wi{x) 



K(^) 




( 20 , 32 ) 



The form of this estimator comes from hrst estimating the joint density 
f{x^y) using kernel density estimation and then inserting the estimate into 
the formula, 



r{x) = E(y|.Y = I) = jvm=‘}dy = 

20.21 Theorem. Suppose thatY(ei) = . The risk of the Nadaray a- Watson 

kernel estimator is 



R{?n,r) 



4 

+ 



f J K‘^{x)dx 
J nhf{x) 



r'\x) + 2r\x) 



/(^) 



dx. 



dx 



(20.33) 



The optimal bandwidth deereases at rate n and with this choiee the risk 
deereases at rate 



In practice, to choose the bandwidth h we minimize the cross validation 
score 

n 

^(/i) (20-34) 

where is the estimator we get by omitting the i^^ variable. Fortunately, 
there is a shortcut formula for computing J. 



20.22 Theorem. J ean be written as 



J{h) = 

i=l 



1 



1 - 



K(0) 






2 



(20.35) 



20.23 Example. Figures 20.8 shows cosmic microwave background (CMB) 
data from BOOMERaNG (Netterheld et al. (2002)), Maxima (Lee et al. 
(2001)), and DASI (Halverson et al. (2002))). The data consist of n pairs 
(xi,Ti), . . ., {xn^Yn) where Xi is called the multipole moment and Yi is the 
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estimated power spectrum of the temperature fluctuations. What you are see- 
ing are sound waves in the cosmic microwave background radiation which is 
the heat, left over from the big bang. If r{x) denotes the true power spectrum, 
then 

Yi = r{xi) + Ci 

where is a random error with mean 0. The location and size of peaks in 
r{x) provides valuable clues about the behavior of the early universe. Figure 
20.8 shows the ht based on cross-validation as well as an undersmoothed and 
oversmoothed fit. The cross-validation fit shows the presence of three well- 
defined peaks, as predicted by the physics of the big bang. ■ 

The procedure for hnding conhdence bands is similar to that for density 
estimation. However, we hrst need to estimate Suppose that the x^’s are 
ordered. Assuming r{x) is smooth, we have r(x^+i) — r{xi) ~ 0 and hence 



= 






r{xi) + €i 






and hence 



V(y,+i - Y,) « V(e,+i - £,) = V(£i+i) + V(ei) = 2(j2. 

We can thus use the average of the n — 1 differences Yi^i — Yi to estimate 
Hence, dehne 

^ n— 1 

= ( 20 . 36 ) 

^ ^ i=l 

As with density estimate, the conhdence band is for the smoothed version 
^n(^) = ^(^n(^)) of fho true regression function r. 




322 



20. Nonpar ametric Curve Estimation 






Just Right (Using cross— valdiation) 




FIGURE 20.8. Regression analysis of the CMB data. The first fit is undersmoothed, 
the second is oversmoothed, and the third is based on cross-validation. The last 
panel shows the estimated risk versus the bandwidth of the smoother. The data are 
from BOOMERaNG. Maxima, and DASI. 
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Confidence Bands for Kernel Regression 

An approximate 1 — a confidence band for (x) is 

£n{x) =rn{x) - q se{x), Un{x) = rn{x) + q se{x) (20.37) 

where 




b — a 
m = , 

UJ 



a is defined in (20.36) and co is the width of the kernel. In case the kernel 
does not have hnite width then we take co to be the effective width, that 
is, the range over which the kernel is non-negligible. In particular, we take 
u = ?>h for the Normal kernel. 

20.24 Example. Figure 20.9 shows a 95 percent conhdence envelope for the 
CMB data. We see that we are highly conhdent of the existence and position 
of the hrst peak. We are more uncertain about the second and third peak. 
At the time of this writing, more accurate data are becoming available that 
apparently provide sharper estimates of the second and third peak. ■ 

The extension to multiple regressors X = (Xi, . . . ,X^) is straightforward. 
As with kernel density estimation we just replace the kernel with a multivari- 
ate kernel. However, the same caveats about the curse of dimensionality apply. 
In some cases, we might consider putting some restrictions on the regression 
function which will then reduce the curse of dimensionality. For example, 
additive regression is based on the model 

p 

Y = J2rj{Xp + e. (20.38) 

J = 1 

Now we only need to fit p one-dimensional functions. The model can be en- 
riched by adding various interactions, for example, 

y = Zr,(X,) + Y: + e. (20.39) 

j=l j<k 

Additive models are usually ht by an algorithm called backfitting. 
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2>00 400 AOO noo 1000 



FIGURE 20.9. 95 percent confidence envelope for the CMB data. 



Backfitting 

1. Initialize ri(xi), . . . , rp{xp). 

2. For j = 1, . . . ,p: 

(a) Let €i=Yi- 

(b) Let Tj be the function estimate obtained by regressing the e^’s 

on the covariate. 

3. If converged STOP. Else, go back to step 2. 

Additive models have the advantage that they avoid the curse of dimension- 
ality and they can be ht quickly, but they have one disadvantage: the model 
is not fully nonparametric. In other words, the true regression function r{x) 
may not be of the form (20.38). 



20.5 Appendix 

Confidence Sets and Bias. The conhdence bands we computed are not 
for the density function or regression function but rather for the smoothed 
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function. For example, the confidence band for a kernel density estimate with 
bandwidth h is a band for the function one gets by smoothing the true function 
with a kernel with the same bandwidth. Getting a confidence set for the true 
function is complicated for reasons we now explain. 

Let fn{x) denote an estimate of the function f{x). Denote the mean and 
standard deviation of fn{x) by fn{x) and Then, 

fnjx) - f{x) ^ fn{x) -Jnjx) J^{x) - f{x) 

Sn{x) Sn{x) Sn{x) 

Typically, the first term converges to a standard Normal from which one de- 
rives confidence bands. The second term is the bias divided by the standard 
deviation. In parametric inference, the bias is usually smaller than the stan- 
dard deviation of the estimator so this term goes to 0 as the sample size 
increases. In nonparametric inference, optimal smoothing leads us to balance 
the bias and the standard deviation. Thus the second term does not vanish 
even with large sample sizes. This means that the confidence interval will not 
be centered around the true function /. 

20.6 Bibliographic Remarks 

Two very good books on density estimation are Scott (1992) and Silverman 
(1986). The literature on nonparametric regression is very large. Two good 
starting points are Hardle (1990) and Loader (1999). The latter emphasizes a 
class of techniques called local likelihood methods. 



20.7 Exercises 



1. Let Xi, . . . , Xn ^ f and let fn be the kernel density estimator using the 
boxcar kernel: 



(a) Show that 






_ 1 rx+(h/2) 

E(/(x)) = - / f(y)dp 



h 



'x-{hl2) 



and 






L 



x-\-{h/2) / px-\-{h/2) \ " 

f{y)dy - / f{y)dy 

x—ihjX) \J x—ihjX) j 




326 20. Nonpar ametric Curve Estimation 

^ P 

(b) Show that if h ^ 0 and n/i ^ oo as n ^ oo, then fn(x) — ^ f{x). 

2. Get the data on fragments of glass collected in forensic work from the 
book website. Estimate the density of the hrst variable (refractive in- 
dex) using a histogram and use a kernel density estimator. Use cross- 
validation to choose the amount of smoothing. Experiment with different 
binwidths and bandwidths. Comment on the similarities and differences. 
Construct 95 percent confidence bands for your estimators. 

3. Consider the data from question 2. Let Y be refractive index and let 
X be aluminum content (the fourth variable). Perform a nonparametric 
regression to fit the model Y = f{x)-\-e. Use cross-validation to estimate 
the bandwidth. Construct 95 percent confidence bands for your estimate. 

4. Prove Lemma 20.1. 

5. Prove Theorem 20.3. 

6. Prove Theorem 20.7. 

7. Prove Theorem 20.15. 

8. Consider regression data (xi, Yi), . . . , (x^, Y^). Suppose that 0 < x^ < 1 
for all i. Define bins Bj as in equation (20.7). For x G Bj define 

f„{x) = Yj 

where Y j is the mean of all the Y^’s corresponding to those x^’s in Bj. 
Find the approximate risk of this estimator. From this expression for 
the risk, find the optimal bandwidth. At what rate does the risk go to 
zero? 

9. Show that with suitable smoothness assumptions on r(x), in equation 
(20.36) is a consistent estimator of cr^. 



10. Prove Theorem 20.22. 
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Smoothing Using Orthogonal Functions 



In this chapter we will study an approach to nonparametric curve estima- 
tion based on orthogonal functions. We begin with a brief introduction to 
the theory of orthogonal functions, then we turn to density estimation and 
regression. 



21.1 Orthogonal Functions and L 2 Spaces 

Let V = (ui,U2,U3) denote a three-dimensional vector, that is, a list of three 
real numbers. Let V denote the set of all such vectors. If a is a scalar (a 
number) and u is a vector, we define av = (aui, au 2 , avs). The sum of vectors 
V and w is defined hy v-\-w = (ui + rci , U2 + UI2 , U3 + rc3 ) . The inner product 
between two vectors v and w is defined by (v^w) = Xl?=i The norm 
(or length) of a vector v is defined by 



= V{v,v) = 






i=l 



( 21 . 1 ) 



Two vectors are orthogonal (or perpendicular) if {v^w) = 0. A set of 
vectors are orthogonal if each pair in the set is orthogonal. A vector is normal 

if INI = 1- 
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Let (^1 = (1, 0, 0), 02 = (0, 1, 0), 03 = (0, 0, !)• These vectors are said to be 
an orthonormal basis for V since they have the following properties: 

(i) they are orthogonal; 

(ii) they are normal; 

(hi) they form a basis for V, which means that any 'C G V can be written as a 
linear combination of 0i, 02, 03 : 

3 

V = where f3j = {(j)j,v). (21.2) 

j=i 

For example, if v = (12,3,4) then v = 120i + 302 + 403. There are other 
orthonormal bases for V, for example. 



, / 1 1 M M 1 ^ 

(Ts’Ts’TsJ’ (v72’“72’ 



/ - / 1 12 

(^/ 6 ’^/ 6 ’ ^/6 



Yon can check that these three vectors also form an orthonormal basis for V. 
Again, if v is any vector then we can write 

3 

V = where f3j = (0j,'c). 

j=i 



For example, if v = (12, 3, 4) then 



V = 10.9701 + 6.3602 + 2.8603. 



Now we make the leap from vectors to functions. Basically, we just replace 
vectors with functions and sums with integrals. Let L2(a,6) denote all func- 
tions dehned on the interval [a, b] such that f{x)‘^dx < oo: 



L2(a, 6) = |/ : [a, 6] ^ M, J f{x)‘^dx < oo| . (21.3) 

We sometimes write L 2 instead of T2(a, b). The inner product between two 
functions /, ^ G L2 is dehned by f f{x)g{x)dx. The norm of / is 



ll/ll = y y f{xYdx. (21.4) 

Two functions are orthogonal if J f(x)g{x)dx = 0. A function is normal if 

11/11 = 1 . 

A sequence of functions 0i, 02, 03, • • • is orthonormal if J 0|(x)dx = 1 for 
each j and J (j)i{x)(j)j{x)dx = 0 for i 0 j. An orthonormal sequence is com- 
plete if the only function that is orthogonal to each 0^ is the zero function. 
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In this case, the functions <^ 2 , (^ 3 , • • • form in basis, meaning that if / G L 2 
then / can be written as^ 

00 .5 

/(^) = "where Pj = / f{x)<j)j{x)dx. (21.5) 

j = l Ja 

A useful result is Parseval’s relation which says that 

/ oo 

f{x)dx = '^P^ = \\P\\^ (21.6) 

i=i 

where P = {Pi,p2, ■ ■ ■)• 

21.1 Example. An example of an orthonormal basis for ^ 2 ( 0 , 1) is the cosine 
basis dehned as follows. Let (j)o{x) = 1 and for j > I dehne 

(j)j{x) = V 2 COs{j7Tx). (21.7) 

The hrst six functions are plotted in Figure 21.1. ■ 

21.2 Example. Let 

/(I) = sin (jj^) 

which is called the “doppler function.” Figure 21.2 shows / (top left) and its 
approximation 

j 

fj(x) ='^Pj(f>j{x) 

J=1 

with J equal to 5 (top right), 20 (bottom left), and 200 (bottom right). 
As J increases we see that fj{x) gets closer to f{x). The coefficients [3j = 
fo f{x)(f)j{x)dx were computed numerically. ■ 

21.3 Example. The Legendre polynomials on [—1, 1] are dehned by 

i = 0-l,2,... (21.8) 

It can be shown that these functions are complete and orthogonal and that 

(21.9) 



^The equality in the displayed equation means that J{f{x) — fn{x)pdx — i- 0 where fn{x) = 
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FIGURE 21.1. The hrst six functions in the cosine basis. 




FIGURE 21.2. Approximating the doppler function with its expansion 
in the cosine basis. The function / (top left) and its approximation 
fj{x) = '^j=i Pj(t>j{x) with J equal to 5 (top right), 20 (bottom left), 
and 200 (bottom right). The coefficients [3j = f{x)(j)j{x)dx were 
computed numerically. 
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It follows that the functions = a/(2j + l)/2Pj{x), j = 0, 1, . . . form an 

orthonormal basis for L 2 (— 1, 1). The first few Legendre polynomials are: 



Pq{x) = 1, 

Pl{x) = X, 

P 2 (x) = 1 — and 

Ps{x) = 



These polynomials may be constructed explicitly using the following recursive 
relation: 



Pj + l{x) 



(2j + l)xPj{x) -jPj_i{x) 

j + 1 



( 21 . 10 ) 



The coefficients /3i,/32, . . . are related to the smoothness of the function /. 
To see why, note that if / is smooth, then its derivatives will be hnite. Thus we 
expect that, for some /c, (x))^dx < oo where is the derivative 

of /. Now consider the cosine basis (21.7) and let f(x) = Then, 



J = 1 



2k 



The only way that can be hnite is if the /3j’s get small when 

j gets large. To summarize: 



If the function / is smooth, then the coefficients /3j will be small 
when j is large. 



For the rest of this chapter, assume we are using the cosine basis unless 
otherwise specihed. 



21.2 Density Estimation 

Let Xi, . . . ,X^ be IID observations from a distribution on [0, 1] with density 
/. Assuming / G L 2 we can write 

00 

fix) = '^Pj<f>j{x) 

j=0 

where (^ 1 , (^ 2 , • • • is an orthonormal basis. Dehne 

I ^ 

1=1 



( 21 . 11 ) 
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21.4 Theorem. The mean and varianee of fSj are 



E 







j 

n 



where 

a] = Y{<Pj{Xi)) = I - Pjff{x)d: 
Proof. The mean is 



E 



% 



i=l 

E(0,(Xi)) 



/ (j)j{x)f{x)dx = (3j 



(21.12) 

( 21 . 13 ) 



The calculation for the variance is similar. ■ 

Hence, [3j is an unbiased estimate of (3j. It is tempting to estimate / by 
but this turns out to have a very high variance. Instead, consider 

the estimator 

j 

fix) ='^j3j(pj{x). ( 21 . 14 ) 

J = 1 

The number of terms J is a smoothing parameter. Increasing J will decrease 
bias while increasing variance. For technical reasons, we restrict J to he in 
the range 

I < J <p 

where p = p{n) = y/n. To emphasize the dependence of the risk function on 
J, we write the risk function as R{J). 



21.5 Theorem. The risk of f is 



J ^2 






j = l J = J+1 



An estimate of the risk is 



J ^2 P 



n 

3 = 1 j=J+i 






where = max{a, 0} and 






n — 1 



n 



i=l 



( 21 . 15 ) 



( 21 . 16 ) 



( 21 . 17 ) 
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To motivate this estimator, note that is an unbiased estimate of and 
is an unbiased estimator of We take the positive part of the latter 
term since we know that cannot be negative. We now choose 1 < J < p to 
minimize i?(/, /). Here is a summary: 



Summary of Orthogonal Function Density Estimation 



1. Let 







2. Choose J to minimize R{J) over I < J < p = ^/n where R is given in 

equation (21.16). 

3. Let 



fix) = '^djfjix). 

j=i 



The estimator can be negative. If we are interested in exploring the 
shape of /, this is not a problem. However, if we need our estimate to be a 
probability density function, we can truncate the estimate and then normalize 
it. That is, we take /* = max{/^ (x),0}/ /q max{/„(u),0}(iu. 

Now let us construct a conhdence band for /. Suppose we estimate / using 
J orthogonal functions. We are essentially estimating fj{x) = J2j=i 
not the true density f{x) = Thus, the conhdence band should 

be regarded as a band for fj{x). 
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Cj ^ A^(0, 1), and therefore 



T 1 2 2 / 2 d 2 



J = 1 



Thus we have, approximately, that 



j=i 






Also, 



( 21 . 20 ) 



max |/j(x) - /j(a;)| < maxW \4>j{x)\ \(3j - Pj 

X X ^ ^ 

J = 1 
J 

< KYl^i-/5il 

J = 1 



< Vja: 

= VjkVl 






where the third inequality is from the Cauchy- Schwartz inequality (Theorem 
4.8). So, 



max|/j(x) - /j(x)| >K‘^\ 



{' 



< p I Vja:/l > 



Vl>k\ 



Ixh 



n 



L > 






< a. 



21.7 Example. Let 



5 1 

f{x) = -^4>{x-, 0> 1) + X Y ^^3■^ -1) 



J=1 



where (j){x; a) denotes a Normal density with mean /i and standard deviation 
cr, and (/ii, . . . , /is) = (—1, —1/2, 0, 1/2, 1). Marron and Wand (1992) call this 
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FIGURE 21.3. The top plot is the true density for the Bart Simpson distribution 
(rescaled to have most of its mass between 0 and 1). The bottom plot is the orthog- 
onal function density estimate and 95 percent conhdence band. 

“the claw” although the “Bart Simpson” might be more appropriate. Figure 
21.3 shows the true density as well as the estimated density based on n = 
5, 000 observations and a 95 percent confidence band. The density has been 
rescaled to have most of its mass between 0 and 1 using the transformation 
y = {x ^ 3)/6. ■ 



21.3 Regression 



Consider the regression model 



Yi = r{xi) + e^, i = 1, . . . ,n 



(21.21) 



where the are independent with mean 0 and variance cr^. We will initially 
focus on the special case where Xi = i/n. We assume that r G ^ 2 ( 0 , 1) and 
hence we can write 



00 

r{x) ='^^^[5j(^j{x) where [3j 
j=i 




r{x)(j)j{x)dx 



( 21 . 22 ) 



where (^ 1 , (^ 2 , • • • where is an orthonormal basis for [0, 1]. 
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Define 

1 

/?i = j = (21.23) 

^=1 

Since j6j is an average, the central limit theorem tells us that j6j will be 
approximately Normally distributed. 

21.8 Theorem. 

Proof. The mean of j6j is 

^ n 1 ^ 

^ i=l ^ i=l 

^ f r{x)(f)j{x)dx = Pj 




be the risk of the estimator. 

21.9 Theorem. The risk R{J) of the estimator rn{x) = Pjfj{x) is 

j 2 ^ 

R(J) = P^+ Yi Pi ( 21 . 25 ) 

^ J=J+1 
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To estimate for cr^ = V(e^) we use 



I E 



(21.26) 



where k = n/4. To motivate this estimator, recall that if / is smooth, then 
(3j ^ 0 for large j. So, for j > /c, (3j ^ iV(0, jn) and thus, aZj! ^/n for 

for j > k, where ^ A^(0, 1). Therefore, 



? E E 



i=n-k-\-l 



i=n-k-\-l 



— W ^ 

^ 2^ Pj 

-i=n — Ac+1 



since a sum of k Normals has a Xk distribution. Now E(xi) = ^ ^^d hence 
E(a^) ^ cr^. Also, V(x^) = 2/c and hence V(a^) (a^/k‘^){2k) = {2a^/k) 0 

as n ^ oo. Thus we expect a‘^ to be a consistent estimator of cr^. There is 
nothing special about the choice k = n/4. Any k that increases with n at an 
appropriate rate will suffice. 

We estimate the risk with 



j=J+i ^ 



(21.27) 



21.10 Example. Figure 21.4 shows the doppler function / and n = 2,048 
observations generated from the model 

Yi = r(xi) + Ci 

where Xi = i/n, ^ A^(0, (.1)^). The figure shows the data and the estimated 
function. The estimate was based on J = 234 terms. ■ 

We are now ready to give a complete description of the method. 



Orthogonal Series Regression Estimator 



1. Let 



/3j = - N j = 



2. Let 



I E ^ 

-i=n— /c+1 



(21.28) 





Finally, we turn to confidence bands. As before, these bands are not really 
for the true function r(x) but rather for the smoothed version of the function 
rj{x) = 

21.11 Theorem. Suppose the estimate r is based on J terms and a is defined 
as in equation (21.28). Assume that J < n — k 1. An approximate 1 — a 
confidence band for rj is {£^ u) where 

£{x) = ?n{x) - c, u{x) = ?n{x) ^ c, ( 21 . 29 ) 

where 
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and a is given in equation (21.28). 

Proof. Let L = ~ PjY' central limit theorem, [3j ^ 

N{f3j^a‘^ /n). Hence, f3j ^ j3j + Ge^j^Jn where tj ^ A^(0, 1) and therefore 



L 



n 




d 




Thus, 

Also, 



{l > - P > ^xla) - «• 



|r(x) - rj{x)\ < ^ \4>j{x)\ \(3j - pj 

J = 1 



< 






< a{x) \/Z 



Efft - ff. 



by the Cauchy- Schwartz inequality (Theorem 4.8). So, 



\fj{x) - f{x)\ 
a{x) 



^XJ,a 

\/n 



< p(vl>^) 



and the result follows. ■ 

21.12 Example. Figure 21.5 shows the conhdence envelope for the doppler 
signal. The hrst plot is based on J = 234 (the value of J that minimizes the 
estimated risk) . The second is based on J = 45 Larger J yields a higher 

resolution estimator at the cost of large conhdence bands. Smaller J yields a 
lower resolution estimator but has tighter conhdence bands. ■ 

So far, we have assumed that the x^’s are of the form {1/n, 2/n, . . . , 1}. 
If the XiS, are on interval [a, 5], then we can rescale them so that are in the 
interval [0, 1]. If the x^’s are not equally spaced, the methods we have discussed 
still apply so long as the x^’s “hll out” the interval [0,1] in such a way so as to 
not be too clumped together. If we want to treat the x^’s as random instead 
of hxed, then the method needs signihcant modihcations which we shall not 
deal with here. 
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FIGURE 21.5. Estimates and conhdence bands for the doppler test function using 
n = 2,048 observations. First plot: J = 234 terms. Second plot: J = 45 terms. 



21.4 Wavelets 

Suppose there is a sharp jump in a regression function / at some point x 
but that / is otherwise very smooth. Such a function / is said to be spa- 
tially inhomogeneous. The doppler function is an example of a spatially 
inhomogeneous function; it is smooth for large x and unsmooth for small x. 

It is hard to estimate / using the methods we have discussed so far. If we 
use a cosine basis and only keep low order terms, we will miss the peak; if 
we allow higher order terms we will hnd the peak but we will make the rest 
of the curve very wiggly. Similar comments apply to kernel regression. If we 
use a large bandwidth, then we will smooth out the peak; if we use a small 
bandwidth, then we will hnd the peak but we will make the rest of the curve 
very wiggly. 

One way to estimate inhomogeneous functions is to use a more carefully 
chosen basis that allows us to place a “blip” in some small region without 
adding wiggles elsewhere. In this section, we describe a special class of bases 
called wavelets, that are aimed at hxing this problem. Statistical inference 
using wavelets is a large and active area. We will just discuss a few of the 
main ideas to get a havor of this approach. 

We start with a particular wavelet called the Haar wavelet. The Haar 
father wavelet or Haar scaling function is dehned by 




I if 0 < X < I 
0 otherwise. 



(21.30) 
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(21.31) 



(21.32) 



The function has the same shape as ijj but it has been rescaled by a factor 
of 2 -^/^ and shifted by a factor of k. 

See Figure 21.6 for some examples of Haar wavelets. Notice that for large 
j, is a very localized function. This makes it possible to add a blip to a 
function in one place without adding wiggles elsewhere. Increasing j is like 
looking in a microscope at increasing degrees of resolution. In technical terms, 
we say that wavelets provide a multiresolution analysis of ^ 2 ( 6 , 1). 




FIGURE 21.6. Some Haar wavelets. Left: the mother wavelet Right: ^ 2 , 2 {x). 

Let 

Wj = {^jk, ^ = 0 , 1 ,. .., 2 ^- 1 } 

be the set of rescaled and shifted mother wavelets at resolution j. 

21.13 Theorem. The set of functions 

is an orthonormal basis for ^ 2 ( 6 , 1 ). 
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It follows from this theorem that we can expand any function / G ^ 2 ( 0 , 1) in 
this basis. Because each Wj is itself a set of functions, we write the expansion 
as a double sum: 




We call a the scaling coefficient and the jSj^kS are called the detail 
coefficients. We call the finite sum 



j-i 2^-1 

fj{x) = a4>{x) + EE I3j,ki’j,ki.x) (21.34) 

j=0 k=0 

the resolution J approximation to /. The total number of terms in this sum 
is 

j-i 

1 + y] 2^' = 1 + 2-^ - 1 = 2-^. 

i=o 

21.14 Example. Figure 21.7 shows the doppler signal, and its reconstruction 
using J = 3, 5 and J = 8. ■ 

Haar wavelets are localized, meaning that they are zero outside an interval. 
But they are not smooth. This raises the question of whether there exist 
smooth, localized wavelets that from an orthonormal basis. In 1988, Ingrid 
Daubechie showed that such wavelets do exist. These smooth wavelets are 
difficult to describe. They can be constructed numerically but there is no 
closed form formula for the smoother wavelets. To keep things simple, we will 
continue to use Haar wavelets. 

Consider the regression model Yi = r(x^) + aei where ^ A^(0, 1) and 
Xi = ijn. To simplify the discussion we assume that n = 2^ for some J. 

There is one major difference between estimation using wavelets instead of 
a cosine (or polynomial) basis. With the cosine basis, we used all the terms 
1< j< J for some J. The number of terms J acted as a smoothing parameter. 
With wavelets, we control smoothing using a method called thresholding 
where we keep a term in the function approximation if its coefficient is large. 
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FIGURE 21.7. The doppler signal and its reconstruction 
fj(x) = a(j){x) + J2jZo ZZk Pj,kZj,k{x) based on J = 3, J = 5, and J = 8. 

otherwise, we throw out that term. There are many versions of thresholding. 
The simplest is called hard, universal thresholding. Let J = log 2 (n) and define 

a= ^Y^(f>k{xi)Y, and = (21.35) 

i i 

for 0 < j < J — 1 . 



Haar Wavelet Regression 

1. Compute a and Dj^k as in (21.35), for 0 < j < J — 1. 

2. Estimate a; see (21.37). 

3. Apply universal thresholding: 

Dj^k if \Dj,k\ > 

0 otherwise. 




(21.36) 



4. Set fix) = a4>{x) + /t',fcV’i.fe(a;)- 
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In practice, we do not compute Sk and Dj^k using (21.35). Instead, we use 
the discrete wavelet transform (DWT) which is very fast. The DWT for 
Haar wavelets is described in the appendix. The estimate of a is 



(J = \/n X 



median (|T)j_i^/c| • /c = 0 ,..., 2 ^ ^ ~ 



0.6745 



(21.37) 



The estimate for a may look strange. It is similar to the estimate we used 
for the cosine basis but it is designed to be insensitive to sharp peaks in the 
function. 

To understand the intuition behind universal thresholding, consider what 
happens when there is no signal, that is, when (3j^k = 0 for all j and k. 



21.15 Theorem. Suppose that f3j^k = 0 for all j and k and let fdj^k be the 
universal threshold estimator. Then 



= 0 for all j, k) ^ 1 



as n ^ oo. 



Proof. To simplify the proof, assume that a is known. Now Dj^k ~ 
A^(0,cr^/n). We will need Mill’s inequality (Theorem 4.7): if Z ^ A^(0, 1) 
then IP(|Z| > t) < (c/h)e~^ where c = is a constant. Thus, 



F(mdiX \Dj^k\ > ^) < 

< 



>A) = ^P 

j,k 

E ccr f 1 nX^ 1 

f 0. ■ 

\/2 log n 



Vn\Dj,k\ VriX\ 



21.16 Example. Consider Yi = r{xi) + aci where / is the doppler signal, 
cr = .1 and n = 2, 048. Figure 21.8 shows the data and the estimated function 
using universal thresholding. Of course, the estimate is not smooth since Haar 
wavelets are not smooth. Nonetheless, the estimate is quite accurate. ■ 
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FIGURE 21.8. Estimate of the Doppler function using Haar wavelets and universal 
thresholding. 

21.5 Appendix 

The DWT for Haar Wavelets. Let y be the vector of (length n) and 
let J = log 2 (n). Create a list D with elements 



D[[0]], D[[J-1]]. 



Set: 



Then do: 



temp ^ yj\[n. 



/or(j 


in 






m 




2^' 




I 




(1 : m) 




DM] 




1 temp[2 * /] — temp[{2 */) — !] 


)/v^ 


temp 




itemp[2 * /] + temp[{2 * /) — 1] 


)/v^ 



} 
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21.6 Bibliographic Remarks 

Efromovich (1999) is a reference for orthogonal function methods. See also 
Beran (2000) and Beran and Diimbgen (1998). An introduction to wavelets is 
given in Ogden (1997). A more advanced treatment can be found in Hardle 
et al. (1998). The theory of statistical estimation using wavelets has been 
developed by many authors, especially David Donoho and Ian Johnstone. See 
Donoho and Johnstone (1994), Donoho and Johnstone (1995), Donoho et al. 
(1995), and Donoho and Johnstone (1998). 

21.7 Exercises 

1. Prove Theorem 21.5. 

2. Prove Theorem 21.9. 

3. Let 

^ (vi’ 71’ Tl) ’ " (tI’ “ 71’°) ’ ^ (yi’ "7i) ■ 

Show that these vectors have norm 1 and are orthogonal. 

4. Prove ParsevaFs relation equation (21.6). 

5. Plot the first five Legendre polynomials. Verify, numerically, that they 
are orthonormal. 

6. Expand the following functions in the cosine basis on [0,1]. Eor (a) 
and (b), find the coefficients Pj analytically. Eor (c) and (d), find the 
coefficients jSj numerically, i.e. 

.1 -1 N 

for some large integer N. Then plot the partial sum Y^^=i 
increasing values of n. 

(a) f{x) = \/2 cos(37tx). 

(b) f{x) = sin(7Tx). 

(c) f{x) = hjK{x-tj) where K{t) = (l+sign(t))/2, sign(a;) = - 

if X < 0, sign(x) = 0 if x = 0, sign(x) = 1 if x > 0, 



1 
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(tj) = (.1, .13, .15, .23, .25, .40, .44, .65, .76, .78, .81), 

(hj) = (4, -5, 3, -4, 5, -4.2, 2.1, 4.3, -3.1, 2.1, -4.2). 

(d) / = V^(l - x) sin . 

7. Consider the glass fragments data from the book’s website. Let Y be 
refractive index and let X be aluminum content (the fourth variable). 

(a) Do a nonparametric regression to ht the model Y = /(x) + e using 
the cosine basis method. The data are not on a regular grid. Ignore this 
when estimating the function. (But do sort the data hrst according to 
X.) Provide a function estimate, an estimate of the risk, and a conhdence 
band. 

(b) Use the wavelet method to estimate /. 

8. Show that the Haar wavelets are orthonormal. 

9. Consider again the doppler signal: 

nx) = ■ 

Let n = 1, 024, a = 0.1, and let (xi, . . . , = (1/n, . . . , 1). Generate 

data 

Yi = f{xi) + ere* 

where e 

i 

(a) Fit the curve using the cosine basis method. Plot the function esti- 
mate and conhdence band for J = 10, 20, ... , 100. 

(b) Use Haar wavelets to ht the curve. 

10. (Haar density Estimation.) Let Xi, . . . ,X^ ^ / for some density / on 
[0, 1]. Let’s consider constructing a wavelet histogram. Let (j) and ip be 
the Haar father and mother wavelet. Write 

J-l 2^-1 

f{x) W (p{x) + J2Y l^ 3 ,k'<PjA^) 
j=0 k=0 



- 1 

Pj,k = ~ 'Pj,k{^i)' 

^ z=l 



where J ~ log 2 (n). Let 
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(a) Show that jSj^k is an unbiased estimate of 

(b) Define the Haar histogram 



B 2^ -1 

J{x) = 4>{x) + EE (^jyk'^jyk (^) 



j=0 k=0 



foT 0 < B < J — 1. 

(c) Find an approximate expression for the MSE as a function of B. 

(d) Generate n = 1,000 observations from a Beta (15,4) density. Es- 
timate the density using the Haar histogram. Use leave-one-out cross 
validation to choose B. 



11. In this question, we will explore the motivation for equation (21.37). Let 
Xi,...,Xn -A^(0, ct 2). Let 



a = 



n X 



median (|Xi I,..., |X^|) 
0.6745 



(a) Show that E(a) = a. 

(b) Simulate n = 100 observations from a N(0,1) distribution. Compute 
a as well as the usual estimate of a. Repeat 1,000 times and compare 
the MSE. 

(c) Repeat (b) but add some outliers to the data. To do this, simulate 
each observation from a N(0,1) with probability .95 and simulate each 
observation from a N(0,10) with probability .95. 

12. Repeat question 6 using the Haar basis. 




22 

Classification 



22.1 Introduction 

The problem of predicting a discrete random variable Y from another random 
variable X is called classification, supervised learning, discrimination, 
or pattern recognition. 

Consider IID data (Xi, Id), ... , (X^, where 

Xi = eAf 

is a d-dimensional vector and Yi takes values in some hnite set y. A classifi- 
cation rule is a function h : X ^ y . When we observe a new X, we predict 

Y to be h(X). 

22.1 Example. Here is a an example with fake data. Figure 22.1 shows 100 
data points. The covariate X = (Xi,X 2 ) is 2-dimensional and the outcome 

Y G T = {0, 1}. The Y values are indicated on the plot with the triangles 
representing T = 1 and the squares representing T = 0. Also shown is a linear 
classihcation rule represented by the solid line. This is a rule of the form 

, / X _ J 1 if a + hixi + b2X2 > 0 

y 0 otherwise. 



Everything above the line is classihed as a 0 and everything below the line is 
classihed as a 1. ■ 
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X2 



□ 



A 



□ 

□ □ 



A 

A 



A 



A 



□ □ 






FIGURE 22.1. Two covariates and a linear decision boundary. A means Y = 1. 

□ means T = 0. These two groups are perfectly separated by the linear decision 
boundary; you probably won’t see real data like this. 



22.2 Example. Recall the the Coronary Risk-Factor Study (CORIS) data 
from Example 13.17. There are 462 males between the ages of 15 and 64 from 
three rural areas in South Africa. The outcome Y is the presence {Y = 1) or 
absence (Y = 0) of coronary heart disease and there are 9 covariates: systolic 
blood pressure, cumulative tobacco (kg), Idl (low density lipoprotein choles- 
terol), adiposity, famhist (family history of heart disease), typea (type- A be- 
havior), obesity, alcohol (current alcohol consumption), and age. I computed 
a linear decision boundary using the LDA method based on two of the co- 
variates, systolic blood pressure and tobacco consumption. The LDA method 
will be explained shortly. In this example, the groups are hard to tell apart. 
In fact, 141 of the 462 subjects are misclassihed using this classihcation rule. 



At this point, it is worth revisiting the Statistics/Data Mining dictionary: 



Statistics 

classification 

data 

covariates 

classifier 

estimation 



Computer Science 
supervised learning 
training sample 
features 
hypothesis 
learning 



Meaning 

predicting a discrete Y from X 

the X^’s 

map h : X ^ y 

hnding a good classiher 



22.2 Error Rates and the Bayes Classifier 

Our goal is to hnd a classihcation rule h that makes accurate predictions. We 
start with the following dehnitions: 




22.2 Error Rates and the Bayes Classifier 
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22.3 Definition. The true error rate^o/ a classifier h is 

L{h) = n{h{X) ^ F}) (22.1) 

and the empirical error rate or training error rate is 

I ^ 

Ln{h) = - V I{h{X,) ^ Y,). (22.2) 

n 

i=l 

First we consider the special case where 3^ = {0, 1}. Let 

r{x) = E{Y\X = x)= ¥{Y = 1|X = x) 

denote the regression function. From Bayes’ theorem we have that 

r{x) = ¥{Y = l\X = x) 

fjx\Y = 1)F(F = 1) 

f(x\Y = 1)P(F = 1) + f{x\Y = 0)P(F = 0) 

7T/l(x) + (l-7r)/o(x) ^ 

where 

foi.x) = f{x\Y = 0) 

/i(x) = f{x\Y = l) 

7T = P(F = 1). 



22.4 Definition. The Bayes classification rule /i* is 



h*{x) 



1 if r{x) > 2 
0 otherwise. 



(22.4) 



The set V{h) = {x : P(F = 1|X = x) = P(F = 0|X = x)} is called the 

decision boundary. 



Warning! The Bayes rule has nothing to do with Bayesian inference. We 
could estimate the Bayes rule using either frequentist or Bayesian methods. 
The Bayes rule may be written in several equivalent forms: 

^One can use other loss functions. For simplicity we will use the error rate as our loss function. 
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1 if F{Y = 1\X = x)> F{Y = 0\X = x) 
0 otherwise 



h*ix) = l I if ^/i(^)>(l-^)/o(x) (22.6) 

^ ^ 1^ 0 otherwise. ^ ^ 

22.5 Theorem. The Bayes rule is optimal, that is, if h is any other elassifi- 
eation rule then L{h*) < L{h). 

The Bayes rule depends on unknown quantities so we need to use the data 
to find some approximation to the Bayes rule. At the risk of oversimplifying, 
there are three main approaches: 

1. Empirical Risk Minimization. Choose a set of classihers % and hnd h ^ TL 
that minimizes some estimate of L{h). 

2. Regression. Find an estimate r of the regression function r and dehne 

•T/ , / 1 if r{x) > 5 

= I 0 otherwise 

3. Density Estimation. Estimate /o from the X^’s for which Yi = 0, estimate 

fi from the X^’s for which Yi = 1 and let tt = Dehne 



^(x)=P(T = l|X = x) 



"xfijx) 

-fi{x) + (1 - ^)fo(x) 



r. , _ / 1 if r(x) > i 
^ \ 0 otherwise. 

Now let us generalize to the case where Y takes on more than two values 
as follows. 



22.6 Theorem. Suppose that Y e y = {1, . . . , K}. The optimal rule is 



h(x) = argmax;. P(y = k\X = x) 
= argmax^.7rfc /fe(x) 



p(y = fc|x = x) = yy ; ; , (22.9 

Xr Jr(x)TTr 

TTr = P{Y = r), fr{x) = f{x\Y = r) and argmax;i. means “the value of k 
that maximizes that expression. 





22.3 Gaussian and Linear Classifiers 
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22.3 Gaussian and Linear Classifiers 



Perhaps the simplest approach to classification is to use the density estima- 
tion strategy and assume a parametric model for the densities. Suppose that 
3^ = {0, 1} and that fo{x) = f{x\Y = 0) and fi{x) = f{x\Y = 1) are both 
multivariate Gaussians: 

= (27r)rf/2^|Sfc|V2 - Mfe)| , k = 0,1. 

Thus, X\Y = 0 ~ Nino, So) and X|T = 1 ~ AT(/ii, Si). 



22.7 Theorem. If X\Y = 0 ^ Hq) xnd X\Y = 1 ^ Ei), then the 
Bayes rule is 



h*{x) 



1 ifr2<r2 + 21og(fj)+log(|^) 
0 otherwise 



(22.10) 



where 

rf = {x- Hi), i = l,2 (22.11) 

is the Manalahobis distance. An equivalent way of expressing the Bayes ^ 
rule is 

h*{x) = 8iTgm8iXj^Sk{x) 



where 



h{x) = -llog|Sfe| - l(x-/ife)^S^fia;-/ife) + log7rfe (22.12) 

and \A\ denotes the determinant of a matrix A. 



The decision boundary of the above classifier is quadratic so this procedure 
is called quadratic discriminant analysis (QDA). In practice, we use 
sample estimates of tt, /ii, /i 2 , Eq, Ei in place of the true value, namely: 



^0 



ho 

So 






i=l 




El 



1 

no 

1 

no 



E Xi, hi — ^ ^ X-i 

ni 

Yi=0 ^ i: Yi^l 

V (V - Mo)(V - Mo)^, = — E - Pi)(V - Mi)^ 

tin «1 .-.til 



where no = “ L) and ni = Xl* L. 
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A simplification occurs if we assume that Eq = Sq = S. In that case, the 
Bayes rule is 

h*(x) = dii gmdiX j^5k{x) (22.13) 

where now 

5k{x) = Va; - + logTTfe. (22.14) 

The parameters are estimated as before, except that the mle of E is 



The classihcation rule is 



^ _ fipSo ^niSi 
no + Til 



h*(x) 



1 if Si{x) > (5o(x) 
0 otherwise 



(22.15) 



where 

Sj{x) = X^S~^flj - -Jlj + logTTj 

is called the discriminant function. The decision boundary {x : ( 5 q ( x ) = 
5i(x)} is linear so this method is called linear discrimination analysis 
(LDA). 



22.8 Example. Let us return to the South African heart disease data. The 
decision rule in in Example 22.2 was obtained by linear discrimination. The 
outcome was 

classified as 0 classihed as 1 
^ = 0 277 25 

y = l 116 44 

The observed misclassihcation rate is 141/462 = .31. Including all the covari- 
ates reduces the error rate to .27. The results from quadratic discrimination 
are 

classified as 0 classihed as 1 
^ = 0 272 30 

y = l 113 47 

which has about the same error rate 143/462 = .31. Including all the covariates 
reduces the error rate to .26. In this example, there is little advantage to QDA 
over LDA. ■ 



Now we generalize to the case where Y takes on more than two values. 
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22.9 Theorem. Suppose that Y G {1, . . . ,K}. If fk{x) = f{x\Y = k) is 
Gaussian, the Bayes rule is 

h{x) = 8iTgm8iXj^Sk{x) 



where 

h{x) = -llog|Efe| - - iXkfT.p{x - Hk) +log7rfe. (22.16) 

If the variances of the Gaussians are equal, then 

6k{x) = Vfc - (22.17) 



We estimate Sk{x) by by inserting estimates of and tt/^. There is 

another version of linear discriminant analysis due to Fisher. The idea is 
to first reduce the dimension of covariates to one dimension by projecting 
the data onto a line. Algebraically, this means replacing the covariate X = 
(Xi, . . . , Xd) with a linear combination U = X = The goal is 

to choose the vector w = {wi, . . . ,Wd) that “best separates the data.” Then 
we perform classification with the one-dimensional covariate Z instead of X. 

We need define what we mean by separation of the groups. We would like 
the two groups to have means that are far apart relative to their spread. Let 
pij denote the mean of X for Yj and let E be the variance matrix of X. Then 
E(t/|T = j) = ¥.(w^X\Y = j) = w^pij and X(U) = w^T^w. ^ Define the 
separation by 



{E{U\Y = 0)-E{U\Y = 1))2 
w^Ew 

w^Ew 

w^{liQ - /ii)(/iQ - Iii)^w 

w^^Ew 



We estimate J as follows. Let Uj = XlILi = j) be the number of obser- 
vations in group j, let Xj be the sample mean vector of the X’s for group j, 
and let Sj be the sample covariance matrix in group j. Define 



J{w) 



Sbw 

Swxo 



(22.18) 
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where 

Sb = (Xo-Xi)(Xo-X,f 

_ {no — l)5'o + (^^1 — I)*?! 

^ ~ (no - 1) + (m - 1) ■ 

22.10 Theorem. The vector 

w = S^\Xo-X^) (22.19) 

is a minimizer of J{w). We call 

U = w^X = (Xo - (22.20) 

the Fisher linear discriminant function. The midpoint m between Xq and 
Xi is 

m = l(Xo + Xi) = l(Xo - XifSpiXo + Xi) (22.21) 

Fisher’s classification rule is 

i i \ 1 if w'^X < m. 

Fisher’s rule is the same as the Bayes linear classifier in equation (22.14) 
when 7T = 1/2. 



22.4 Linear Regression and Logistic Regression 



A more direct approach to classihcation is to estimate the regression function 
r{x) = E(Y\X = x) without bothering to estimate the densities fk- For the 
rest of this section, we will only consider the case where y = {0, 1}. Thus, 
r(x) = P(y = 1|X = x) and once we have an estimate r, we will use the 
classihcation rule 



h{x) 



1 if r{x) > i 
0 otherwise. 



(22.22) 



The simplest regression model is the linear regression model 



Y = r{x) + e 



d 



/3o + ^ PjXj + e 
j=i 



(22.23) 



where E(e) = 0. This model can’t be correct since it does not force T = 0 or 
1. Nonetheless, it can sometimes lead to a decent classiher. 
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Recall that the least squares estimate of /3 = (/3 q, / 3i, . . . , minimizes 
the residual sums of squares 

n / d 

rss(/3) = ^ ( r* - /3o - ^ XijPj 

i=i ^ j=i 

Let X denote the x (d + 1) matrix of the form 

"1 Xn ... 

1 X 21 . . . X2d 

x = 

1 Xfll . . . 

Also let Y = (Xi, . . . , YnY. Then, 

RSS{(3) = (Y - X/3)^(Y - X/3) 
and the model can be written as 

Y = X/3 + e 

where e = (ei, . . . , e^)^. From Theorem 13.13, 

(X^X)-^X^Y. 

The predicted values are 

Y = 10. 

Now we use (22.22) to classify, where r{x) = /3q + Xlj 

An alternative is to use logistic regression which was also discussed in Chap- 
ter 13. The model is 

pdo+Ej (^jXj 

r{:c) = ny = 1|.Y = .T) = (22.24) 

and the mle f3 is obtained numerically. 

22.11 Example. Let us return to the heart disease data. The mle is given in 
Example 13.17. The error rate, using this model for classihcation, is .27. The 
error rate from a linear regression is .26. 

We can get a better classiher by htting a richer model. For example, we 
could ht 

logit P(y = l|x = x) = /3o + (ijXj + y] PjkXjXk- 

j j,k 




(22.25) 
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More generally, we could add terms of up to order r for some integer r. Large 
values of r give a more complicated model which should ht the data better. 
But there is a bias-variance tradeoff which we’ll discuss later. 

22.12 Example. If we use model (22.25) for the heart disease data with r = 2, 
the error rate is reduced to .22. ■ 



22.5 Relationship Between Logistic Regression and 
LDA 



LDA and logistic regression are almost the same thing. If we assume that each 
group is Gaussian with the same covariance matrix, then we saw earlier that 



/p(y = i|x = x) 

= o|x = x) 



log ( — ) - hMi-Mo) 

\7Ti J Z 

+ - A*o) 

cro + a^x. 



On the other hand, the logistic model is, by assumption, 



/F{Y = 1\X = x) 

VP(r = o|x = x) 



Po + 0^x. 



These are the same model since they both lead to classihcation rules that are 
linear in x. The difference is in how we estimate the parameters. 

The joint density of a single observation is /(x, = f{x\y)f{y) = f{y\x)f(x). 

In LDA we estimated the whole joint distribution by maximizing the likeli- 
hood 

Wf{xi,Vi) = X\f{xi\yi)Wf{yi) ■ (22.26) 

i i i 

Gaussian Bernoulli 

In logistic regression we maximized the conditional likelihood fiVil^i) but 
we ignored the second term f{xi): 



n f{xi, 2 /*) = n fivi \x^) n ■ ( 22 . 2 ?) 




logistic ignored 



Since classihcation only requires knowing f{y\x), we don’t really need to es- 
timate the whole joint distribution. Logistic regression leaves the marginal 
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distribution f{x) unspecified so it is more nonparametric than LDA. This is 
an advantage of the logistic regression approach over LDA. 

To summarize: LDA and logistic regression both lead to a linear classi- 
hcation rule. In LDA we estimate the entire joint distribution f{x^y) = 
f{x\y)f{y). In logistic regression we only estimate f{y\x) and we don’t bother 
estimating f{x). 



22.6 Density Estimation and Naive Bayes 

The Bayes rule is h{x) = argmax;. 7Tkfk{x). If we can estimate iVk and fk 
then we can estimate the Bayes classification rule. Estimating iTk is easy but 
what about We did this previously by assuming fk was Gaussian. An- 
other strategy is to estimate fk with some nonparametric density estimator 
fk such as a kernel estimator. But if x = (xi, . . . , is high-dimensional, 
nonparametric density estimation is not very reliable. This problem is amelio- 
rated if we assume that Xi, . . . , Xd are independent, for then, //c(xi, . . . , Xd) = 
YYj=i fkj{xj)- This reduces the problem to d one-dimensional density estima- 
tion problems, within each of the k groups. The resulting classiher is called 
the naive Bayes classifier. The assumption that the components of X are 
independent is usually wrong yet the resulting classiher might still be accu- 
rate. Here is a summary of the steps in the naive Bayes classiher: 

The Naive Bayes Classifier 

1. For each group /c, compute an estimate fkj of the density fkj for 

using the data for which Yi = k. 

2. Let 

d 

fk{x) = fk{xi,...,Xd) = Y[fkj{Xj). 

i=i 



3. Let 

1 

i=l 

where I {Yi = k) = 1 if Yi = k and I{Yi = k) = 0 if Yi k. 

4. Let 



h{x) = argmaxfc ^k fk{x)- 
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FIGURE 22.2. A simple classification tree. 

The naive Bayes classifier is popular when x is high-dimensional and dis- 
crete. In that case, fkji^j) is especially simple. 



22.7 Trees 



Trees are classihcation methods that partition the covariate space X into 
disjoint pieces and then classify the observations according to which partition 
element they fall in. As the name implies, the classiher can be represented as 
a tree. 

For illustration, suppose there are two covariates, Xi = age and X 2 = blood 
pressure. Figure 22.2 shows a classihcation tree using these variables. 

The tree is used in the following way. If a subject has Age > 50 then we 
classify him as T = 1. If a subject has Age < 50 then we check his blood 
pressure. If systolic blood pressure is < 100 then we classify him as T = 1, 
otherwise we classify him as T = 0. Figure 22.3 shows the same classiher as 
a partition of the covariate space. 

Here is how a tree is constructed. First, suppose that ^ G T = {0, 1} and 
that there is only a single covariate X. We choose a split point t that divides 
the real line into two sets A\ = (— oo,t] and A2 = (t,oo). Let Ps{j) be the 
proportion of observations in Ag such that Yi = j: 



Psij) = 






for 5 = 1, 2 and j = 0, 1. The impurity of the split t is dehned to be 

2 

Ht) = 



(22.28) 



S=1 



(22.29) 
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FIGURE 22.3. Partition representation of classification tree. 



where 

1 

7. = (22.30) 

J=0 

This particular measure of impurity is known as the Gini index. If a partition 
element Ag contains all O’s or all Us, then 75 = 0. Otherwise, 75 > 0. We 
choose the split point t to minimize the impurity. (Other indices of impurity 
besides can be used besides the Gini index.) 

When there are several covariates, we choose whichever covariate and split 
that leads to the lowest impurity. This process is continued until some stopping 
criterion is met. For example, we might stop when every partition element has 
fewer than no data points, where no is some hxed number. The bottom nodes 
of the tree are called the leaves. Each leaf is assigned a 0 or 1 depending on 
whether there are more data points with Y = 0 or U = 1 in that partition 
element. 

This procedure is easily generalized to the case where Y G{l,...,iF}. We 
simply dehne the impurity by 



7 . = 1 - (22.31) 

J = 1 

where Pi{j) is the proportion of observations in the partition element for which 

= 
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age 
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I 

age 
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> 50.5 
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< 7.47 



> 7.47 



0 1 

FIGURE 22.4. A classification tree for the heart disease data using two covariates. 

22.13 Example. A classification tree for the heart disease data yields a mis- 
classification rate of .21. If we build a tree using only tobacco and age, the 
misclassification rate is then .29. The tree is shown in Figure 22.4. ■ 

Our description of how to build trees is incomplete. If we keep splitting 
until there are few cases in each leaf of the tree, we are likely to overfit the 
data. We should choose the complexity of the tree in such a way that the 
estimated true error rate is low. In the next section, we discuss estimation of 
the error rate. 



22.8 Assessing Error Rates and Choosing a Good 
Classifier 

How do we choose a good classifier? We would like to have a classifier h with 
a low true error rate L{h). Usually, we can’t use the training error rate Ln{h) 
as an estimate of the true error rate because it is biased downward. 

22.14 Example. Consider the heart disease data again. Suppose we fit a se- 
quence of logistic regression models. In the first model we include one co- 
variate. In the second model we include two covariates, and so on. The ninth 
model includes all the covariates. We can go even further. Let’s also fit a tenth 
model that includes all nine covariates plus the first covariate squared. Then 
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we fit an eleventh model that includes all nine covariates plus the first covari- 
ate squared and the second covariate squared. Continuing this way we will get 
a sequence of 18 classifiers of increasing complexity. The solid line in Figure 
22.5 shows the observed classification error which steadily decreases as we 
make the model more complex. If we keep going, we can make a model with 
zero observed classification error. The dotted line shows the 10-fold cross- 
validation estimate of the error rate (to be explained shortly) which is a 
better estimate of the true error rate than the observed classification error. 
The estimated error decreases for a while then increases. This is essentially 
the bias- variance tradeoff phenomenon we have seen in Chapter 20. ■ 



error rate 




FIGURE 22.5. The solid fine is the observed error rate and dashed line is the 
cross-validation estimate of true error rate. 



There are many ways to estimate the error rate. We’ll consider two: cross- 
validation and probability inequalities. 

Cross-Validation. The basic idea of cross-validation, which we have al- 
ready encountered in curve estimation, is to leave out some of the data when 
fitting a model. The simplest version of cross-validation involves randomly 
splitting the data into two pieces: the training set T and the validation 
set V. Often, about 10 per cent of the data might be set aside as the validation 
set. The classifier h is constructed from the training set. We then estimate 
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Training Data T 


Validation Data V 






V ' 



h L 



FIGURE 22.6. Cross-validation. The data are divided into two groups: the training 
data and the validation data. The training data are used to produce an estimated 
classiher h. Then, h is applied to the validation data to obtain an estimate L of the 
error rate of h. 



the error by 

= - U HHXi) ^ Yi). (22.32) 

where m is the size of the validation set. See Figure 22.6. 

Another approach to cross-validation is K-fold cross-validation which is 
obtained from the following algorithm. 



iC-fold cross-validation. 

1. Randomly divide the data into K chunks of approximately equal size. 

A common choice is K = 10. 

2. For k = 1 to K, do the following: 

(a) Delete chunk k from the data. 

(b) Compute the classifier from the rest of the data. 

(c) Use to the predict the data in chunk k. Let denote 

the observed error rate. 

3. Let 

1 ^ ^ 

(22.33) 

^ k=l 



22.15 Example. We applied 10-fold cross-validation to the heart disease data. 
The minimum cross-validation error as a function of the number of leaves 
occurred at six. Figure 22.7 shows the tree with six leaves. ■ 
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FIGURE 22.7. Smaller classification tree with size chosen by cross-validation. 

Probability Inequalities. Another approach to estimating the error rate 
is to find a confidence interval for Ln{h) using probability inequalities. This 
method is useful in the context of empirical risk minimization. 

Let 7/ be a set of classifiers, for example, all linear classifiers. Empirical risk 
minimization means choosing the classifier h ^ T~L to minimize the training 
error Ln{h), also called the empirical risk. Thus, 

h = argmin;,g^i„(ft,) = argmin;,^.,^ • (22.34) 

Typically, Ln(h) underestimates the true error rate L(h) because h was chosen 
to make Ln{h) small. Our goal is to assess how much underestimation is taking 
place. Our main tool for this analysis is Hoeffding’s inequality (Theorem 
4.5). Recall that if Xi, . . . , ^ Bernoulli(p), then, for any e > 0, 

¥{\p-p\ > e)< (22.35) 



where p = n ^ 




366 



22. Classification 



First, suppose that 1-L = {hi, . . . , consists of finitely many classifiers. 
For any fixed h, Ln(h) converges in almost surely to L(h) by the law of large 
numbers. We will now establish a stronger result. 

22.16 Theorem (Uniform Convergence). Assume 1~L is finite and has m ele- 
ments. Then, 



P l^n(h) — L(h)\ > ej < 2me . 

Proof. We will use Hoeffding’s inequality and we will also use the fact 
that if Ai, . . . , Am is a set of events then P(Ul^i ^i) ^ Now, 



max|£„(/i) -L(/i)| >e) = P |J - L(/i)| > e 

^ V/seH / 

Hen 

< 



22.17 Theorem. Let 



= 2me 



Hen 



e = 



/ 2 ^ ( 2m 

- log — 
n \ a 



Then Ln(h) ^ e is a 1 — a confidenee interval for L(h). 
Proof. This follows from the fact that 



¥{\Ln{h) - L{h)\ >e) < P max|L^(h)-L(h)| >e 

\hen J 

< 2me~‘^^^ = a. m 

When H is large the conhdence interval for L{h) is large. The more functions 
there are in TL the more likely it is we have “overht” which we compensate 
for by having a larger conhdence interval. 

In practice we usually use sets H that are inhnite, such as the set of linear 
classihers. To extend our analysis to these cases we want to be able to say 
something like 



P ( sup \Ln(h) — L(h)\ > e ) < something not too big. 

\hen J 

One way to develop such a generalization is by way of the Vapnik-Chervonenkis 
or VC dimension. 
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Let ^ be a class of sets. Give a finite set F = {xi, . . . , let 

iV^(F) = #|Ff|^: AeA^ (22.36) 

be the number of subsets of F “picked out” by A. Here #(H) denotes the 
number of elements of a set B. The shatter coefficient is defined by 

s{A^n) = Nji{F) (22.37) 

where Fn consists of all finite sets of size n. Now let Xi, . . . , ^ F and let 

Fn{A) = -y^I{XieA) 



denote the empirical probability measure. The following remarkable the- 
orem bounds the distance between P and 

22.18 Theorem (Vapnik and Chervonenkis (1971)). For any F, n and e > 0; 

P jsup |P„(A) -P(A)| > 4 < 8s{A,n)e~^^r32^ (22.38) 

lAeA J 

The proof, though very elegant, is long and we omit it. If 7/ is a set of 
classifiers, define A to be the class of sets of the form {x : h{x) = 1}. We 
then define s((H^ n) = s(^, n). 

22.19 Theorem. 



P < sup \Ln{h) — L{h)\ > e > < 8s(l-L, n)e 
Vhen ) 

A 1 — a confidence interval for L(h) is Ln(h) ± where 

n \ a J 

These theorems are only useful if the shatter coefficients do not grow too 
quickly with n. This is where VC dimension enters. 



22.20 Definition. The VC (Vapnik- Chervonenkis) dimension of a class of 
sets A is defined as follows. If s{A,n) = 2^ for all n, set VC{A) = oo. 
Otherwise, define VC (A) to be the largest k for whieh s{A,n) = 2^. 



Thus, the VC-dimension is the size of the largest finite set F that can be 
shattered by A meaning that A picks out each subset of F. If 7/ is a set of 
classifiers we define VC{TL) = VC (A) where A is the class of sets of the form 
{x : h{x) = 1} AS h varies in TL. The following theorem shows that if A has 
finite VC-dimension, then the shatter coefficients grow as a polynomial in n. 






368 



22. Classification 



22.21 Theorem. If A has finite VC-dimension v, then 

5(^, n) < + 1. 

22.22 Example. Let A = {(— oo,a]; a G IZ}. The A shatters every 1-point 
set {x} but it shatters no set of the form {x^y}. Therefore, VC{A) = 1. ■ 

22.23 Example. Let A be the set of closed intervals on the real line. Then 
A shatters S = {x, y} but it cannot shatter sets with 3 points. Consider 
S = {x, y, z} where x < y < z. One cannot hnd an interval A such that 
Af]S = {x, z}. So, VC{A) = 2. ■ 

22.24 Example. Let A be all linear half-spaces on the plane. Any 3-point 
set (not all on a line) can be shattered. No 4 point set can be shattered. 
Consider, for example, 4 points forming a diamond. Let T be the left and 
rightmost points. This can’t be picked out. Other conhgurations can also be 
seen to be unshatterable. So VC {A) = 3. In general, halfspaces in IZ^ have 
VC dimension d + 1. ■ 

22.25 Example. Let A be all rectangles on the plane with sides parallel to 
the axes. Any 4 point set can be shattered. Let S' be a 5 point set. There is 
one point that is not leftmost, rightmost, uppermost, or lowermost. Let T be 
all points in S except this point. Then T can’t be picked out. So VC (A) = 4. 



22.26 Theorem. Let x have dimension d and let % he th set of linear elassi- 
fiers. The VC-dimension ofTL d+1. Henee, al — a eonfidenee interval for 
the true error rate is L{h) ± e where 



2 32 ^ 

- log 



8(n^+^ + 1) 



a 



22.9 Support Vector Machines 

In this section we consider a class of linear classihers called support vector 
machines. Throughout this section, we assume that Y is binary. It will be 
convenient to label the outcomes as —I and +I instead of 0 and I. A linear 
classiher can then be written as 



h{x) = sign! H{x) 
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where x = (xi, . . . , Xd)^ 

d 

H{x) = ao + y^^ajXj 

i=l 

and 

( -1 ifz<0 
sign(z) = < 0 if z = 0 

[ 1 if z > 0. 

First, suppose that the data are linearly separable, that is, there exists 
a hyperplane that perfectly separates the two classes. 

22.27 Lemma. The data can be separated by some hyperplane if and only if 
there exists a hyperplane H{x) = ao + Yli=i such that 

YiH{xi) >1, i = 1, . . . , n. (22.39) 

Proof. Suppose the data can be separated by a hyperplane W{x) = bo 
J2i=l It follows that there exists some constant c such that Yi = 1 implies 
W(^i) ^ c and Yi = —1 implies W(Xi) < —c. Therefore, YiW{Xi) > c for 
all i. Let H{x) = ao + where aj = bj/c. Then YiH(Xi) > 1 for all 

i. The reverse direction is straightforward. ■ 

In the separable case, there will be many separating hyperplanes. How 
should we choose one? Intuitively, it seems reasonable to choose the hyper- 
plane “furthest” from the data in the sense that it separates the +ls and -Is 
and maximizes the distance to the closest point. This hyperplane is called the 
maximum margin hyperplane. The margin is the distance to from the 
hyperplane to the nearest point. Points on the boundary of the margin are 
called support vectors. See Figure 22.8. 

22.28 Theorem. The hyperplane H{x) = ao + separates the 

data and maximizes the margin is given by minimizing (1/2) Y^j=\ subject 
to (22.39). 

It turns out that this problem can be recast as a quadratic programming 
problem. Let {Xi^X^) = Xf X^ denote the inner product of Xi and X^. 

22.29 Theorem. Let H{x) = denote the optimal (largest mar- 

gin) hyperplane. Then, for j = 1, ... ,d, 

n 

V = 

i=l 
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FIGURE 22.8. The hyperplane H[x) has the largest margin of all hyperplanes that 
separate the two classes. 



where Xj{i) is the value of the eovariate Xj for the data point, and a = 
(3i, . . . , Sri) is the vector that maximizes 



subject to 



n 

i=l 



^ n n 

aiakY^k{X^,Xk) 

i=i k=i 



a* > 0 



(22.40) 



and 

0 = J2aiYi. 

i 

The points Xi for which a ^ 0 are called support vectors, ao can be found 
by solving 

ai(Yi{Xjd + pf\ =0 



for any support point X^. H may be written as 

n 

H{x) = ao + 

i=l 

There are many software packages that will solve this problem quickly. If 
there is no perfect linear classifier, then one allows overlap between the groups 
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by replacing the condition (22.39) with 

e^>0, i = l,...,n. (22.41) 

The variables , <fn are called slack variables. 

We now maximize (22.40) subject to 

0<<f^<c, i = l,...,n 



and 

n 

^ aXi = 0 . 

The constant c is a tuning parameter that controls the amount of overlap. 



22.10 Kernelization 

There is a trick called kernelization for improving a computationally simple 
classiher h. The idea is to map the covariate X — which takes values in T — 
into a higher dimensional space Z and apply the classiher in the bigger space 
Z. This can yield a more hexible classiher while retaining computationally 
simplicity. 

The standard example of this idea is illustrated in Figure 22.9. The covariate 
X = (xi, X2). The T^s can be separated into two groups using an ellipse. Dehne 
a mapping (/) by 



^ = (zi,Z2,Z3) = <p{x) = (xj,V2XiX2,xl). 

Thus, <p maps X = into ^ = R^. In the higher-dimensional space Z^ the 
y^’s are separable by a linear decision boundary. In other words, 

a linear classifier in a higher-dimensional space corresponds to a non- 
linear classifier in the original space. 

The point is that to get a richer set of classihers we do not need to give up the 
convenience of linear classihers. We simply map the covariates to a higher- 
dimensional space. This is akin to making linear regression more hexible by 
using polynomials. 

There is a potential drawback. If we signihcantly expand the dimension 
of the problem, we might increase the computational burden. For example, 
if X has dimension d = 256 and we wanted to use all fourth-order terms, 
then z = (j){x) has dimension 183,181,376. We are spared this computational 
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FIGURE 22.9. Kernelization. Mapping the covariates into a higher- dimensional 
space can make a complicated decision boundary into a simpler decision bound- 
ary. 

nightmare by the following two facts. First, many classifiers do not require 
that we know the values of the individual points but, rather, just the inner 
product between pairs of points. Second, notice in our example that the inner 
product in Z can be written 

(Z,Z) = (0(x),0(x)) 

= + 2X\X\X^2 + x\x\ 

= ((x, = iC(x, x). 

Thus, we can compute (^,i) without ever computing Zi = (j){Xi). 

To summarize, kernelization involves finding a mapping (j) \ X ^ Z and a 
classifier such that: 

1. Z has higher dimension than X and so leads a richer set of classifiers. 

2. The classifier only requires computing inner products. 

3. There is a function iC, called a kernel, such that (<^(x), (j){x)) = K{x^x). 

4. Everywhere the term (x,T) appears in the algorithm, replace it with 
iC(x, x). 
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In fact, we never need to construct the mapping cj) at all. We only need 
to specify a kernel K{x^x) that corresponds to {(f){x)^ (f)(x)) for some (^. This 
raises an interesting question: given a function of two variables K{x^y), does 
there exist a function (j){x) such that K{x^y) = {(j){x)^ (j){y))? The answer is 
provided by Mercer’s theorem which says, roughly, that if K is positive 
definite — meaning that 



/ / K{x,y)f{x)f{y)dxdy > 0 



for square integrable functions / 
monly used kernels are: 

polynomial K{x^x) 
sigmoid K{x^x) 
Gaussian K{x^x) 



then such a cj) exists. Examples of com- 
= 

= tanh(a(x, T) + 6) 

= expf-||a; -^||2/(2cr2)^ 



Let us now see how we can use this trick in LDA and in support vector 
machines. 

Recall that the Fisher linear discriminant method replaces X with U = 
'uF X where w is chosen to maximize the Rayleigh coefficient 



Sb 



J{w) 



'uF Sbw 
S w'^ ’ 



(Xo-Xi)(Xo-Xif 



and 



Sw = 



(no - 1)Sq 



+ 



(ni - 1)5^1 



^ (no - 1) + (ni - 1) y V (^0 - 1) + (^1 - 1) . 

In the kernelized version, we replace Xi with Zi = (j){Xi) and we find w to 
maximize _ 

Sbw 



where 

and 



Sw = 



J{w) = 

w^^SwU) 

Sb = (Zo-Z,)(Zo-ZW 

(no - 1)^0 \ , f “ l)>^i 



(22.42) 



+ 



(no - 1) + (ni - 1) / I (no - 1) + (ni - 1) 



Here, Sj is the sample of covariance of the Z^s for which Y = j. However, to 
take advantage of kernelization, we need to re-express this in terms of inner 
products and then replace the inner products with kernels. 
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It can be shown that the maximizing vector ic is a linear combination of 
the Z^’s. Hence we can write 

n 

w = y^aiZi. 

i=l 



Also, 



Therefore, 



^ i=l 



W 



= ■?■)) 

^ n n 

vEE a,I{Y, = j)Zj cj>{X,) 

^ i=\ s = l 

-i n n 

- E E = mxif<p{x,) 

^ i=l s=l 
-t Yi n 

^ i=l s=l 
a^Mj 



where Mj is a vector whose component is 

1 

=j)- 

Tin 

s = l 



It follows that 



uF Sbw = Ma 

where M = (Mq — Mi)(Mq — Mi)^ . By similar calculations, we can write 

vo^ Sww = Na 



where 

I is the identity matrix, 1 is a matrix of all one’s, and Kj is the n x Uj 
matrix with entries {Kj)rs = K{xr^Xg) with Xs varying over the observations 
in group j. Hence, we now hnd a to maximize 

, a^Ma 
J{a) = 



a^Na' 
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All the quantities are expressed in terms of the kernel. Formally, the solution 
is a = N~^(Mq — Ml). However, N might be non-invertible. In this case one 
replaces by + bl, for some constant b. Finally, the projection onto the 
new subspace can be written as 

n 

U = ajK{xi^ x). 

i=l 

The support vector machine can similarly be kernelized. We simply replace 
{Xi, Xj) with K(Xi^ Xj). For example, instead of maximizing (22.40), we now 
maximize 

n ^ n n 

aiakYiYkK{Xi,Xj). (22.43) 

i=l i=l k=l 

The hyperplane can be written as H{x) = ao + X^). 



22.11 Other Classifiers 

There are many other classifiers and space precludes a full discussion of all of 
them. Let us briefly mention a few. 

The k-nearest-neighbors classifier is very simple. Given a point x, find 
the k data points closest to x. Classify x using the majority vote of these k 
neighbors. Ties can be broken randomly. The parameter k can be chosen by 
cross-validation. 

Bagging is a method for reducing the variability of a classifier. It is most 
helpful for highly nonlinear classifiers such as trees. We draw B bootstrap 
samples from the data. The b^^ bootstrap sample yields a classifier hi). The 
final classifier is 

/ 1 if i Ef=i h{x) > ^ 

\ 0 otherwise. 

Boosting is a method for starting with a simple classifier and gradually 
improving it by refitting the data giving higher weight to misclassified samples. 
Suppose that 7/ is a collection of classifiers, for example, trees with only 
one split. Assume that Yi G { — 1,1} and that each h is such that h{x) G 
{—1,1}. We usually give equal weight to all data points in the methods we 
have discussed. But one can incorporate unequal weights quite easily in most 
algorithms. For example, in constructing a tree, we could replace the impurity 
measure with a weighted impurity measure. The original version of boosting, 
called AdaBoost, is as follows. 




376 



22. Classification 



1. Set the weights Wi = 1/n, i = 1, . . . , n. 

2. For j = 1, . . . , J, do the following steps: 

(a) Constructing a classiher hj from the data using the weights wi, . . . ,Wn 

(b) Compute the weighted error estimate: 



L,= 



E n 

i=i^i 



(c) Letaj =log{{l- Lj)/Lj). 

(d) Update the weights: 



3. The hnal classifier is 



h{x) = sign E ajhj{x) ). 
b=i ^ 

There is now an enormous literature trying to explain and improve on 
boosting. Whereas bagging is a variance reduction technique, boosting can 
be thought of as a bias reduction technique. We starting with a simple — 
and hence highly-biased — classiher, and we gradually reduce the bias. The 
disadvantage of boosting is that the hnal classiher is quite complicated. 

Neural Networks are regression models of the form ^ 

p 

Y = /3q T ^ ^ /3jCr((ao T X) 
j=i 

where cr is a smooth function, often taken to be a{v) = e^/(l + e^). This 
is really nothing more than a nonlinear regression model. Neural nets were 
fashionable for some time but they pose great computational difficulties. In 
particular, one often encounters multiple minima when trying to hnd the least 
squares estimates of the parameters. Also, the number of terms p is essentially 
a smoothing parameter and there is the usual problem of trying to choose p 
to hnd a good balance between bias and variance. 



^This is the simplest version of a neural net. There are more complex versions of the model. 
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22.12 Bibliographic Remarks 

The literature on classification is vast and is growing quickly. An excellent 
reference is Hastie et al. (2001). For more on the theory, see Devroye et ah 
(1996) and Vapnik (1998). Two recent books on kernels are Scholkopf and 
Smola (2002) and Herbich (2002). 



22.13 Exercises 

1. Prove Theorem 22.5. 

2. Prove Theorem 22.7. 

3. Download the spam data from: 

http: //www-st at. Stanford. edu/'^tibs/ElemStatLearn/index.html 

The data file can also be found on the course web page. The data con- 
tain 57 covariates relating to email messages. Each email message was 
classified as spam (Y=l) or not spam (Y=0). The outcome Y is the last 
column in the file. The goal is to predict whether an email is spam or 
not. 

(a) Construct classification rules using (i) LDA, (ii) QDA, (in) logistic 
regression, and (iv) a classification tree. Eor each, report the observed 
misclassification error rate and construct a 2-by-2 table of the form 





h{x) = 0 


h{x) = 1 


Y = 0 


?? 


?? 


Y = 1 


?? 


?? 



(b) Use 5-fold cross-validation to estimate the prediction accuracy of 
LDA and logistic regression. 

(c) Sometimes it helps to reduce the number of covariates. One strategy 
is to compare Xi for the spam and email group. Eor each of the 57 
covariates, test whether the mean of the covariate is the same or different 
between the two groups. Keep the 10 covariates with the smallest p- 
values. Try LDA and logistic regression using only these 10 variables. 
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4. Let A be the set of two-dimensional spheres. That is, A ^ A if A = 

{{x,y) : — < c^} for some a, c. Find the VC-dimension 

of A. 

5. Classify the spam data using support vector machines. Free software for 
the support vector machine is at http://svmlight.joachims.org/ 

6. Use VC theory to get a confidence interval on the true error rate of the 
LDA classiher for the iris data (from the book web site). 

7. Suppose that G M and that Yi = 1 whenever |X/ < 1 and Yi = 0 
whenever |X/ > 1. Show that no linear classiher can perfectly classify 
these data. Show that the kernelized data Zi = (Xi, Xf) can be linearly 
separated. 

8. Repeat question 5 using the kernel K{x,x) = (1 + x^^x)^. Choose p by 
cross-validation. 

9. Apply the k nearest neighbors classiher to the “iris data.” Choose k by 
cross-validation. 



10. (Curse of Dimensionality.) Suppose that X has a uniform distribution 
on the d-dimensional cube [—1/2, 1/2]^. Let R be the distance from the 
origin to the closest neighbor. Show that the median of R is 




where 

j^d/2 

^ r((d/2) + 1 ) 

is the volume of a sphere of radius r. For what dimension d does the 
median of R exceed the edge of the cube when n = 100, n = 1,000, 
n = 10,000? (Hastie et al. (2001), p. 22-27.) 

11. Fit a tree to the data in question 3. Now apply bagging and report your 
results. 

12. Fit a tree that uses only one split on one variable to the data in question 
3. Now apply boosting. 
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13. Let r{x)=F{Y = 1|X = x) and let r(x) be an estimate of r(x). Consider 
the classifier 

= / 1 if r{x) > 1/2 

i '' \ 0 otherwise. 

Assume that r{x) ^ N{r{x)^ a‘^{x)) for some functions r{x) and cr^(x). 
Show that, for fixed x. 



¥{Y ^h(x)) ^¥{Y ^h%x)) 



\2r{x) — 1 X 1 — <E 



sign r{x) - ( 1 / 2 ) (r(x) - ( 1 / 2 )) 



where is the standard Normal CDF and h* is the Bayes rule. Regard 
sign^(r(x) — (1/2)) (f(x) — (1/2))^ as a type of bias term. Explain the 

implications for the bias-variance tradeoff in classification (Friedman 
(1997)). 

Hint: first show that 

¥{Y ^ h{x)) = \2r{x) - l|F(/i(x) ^ h*(x)) +P(F 7 ^ h*(x)). 
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23.1 Introduction 

Most of this book has focused on IID sequences of random variables. Now we 
consider sequences of dependent random variables. For example, daily tem- 
peratures will form a sequence of time-ordered random variables and clearly 
the temperature on one day is not independent of the temperature on the 
previous day. 

A stochastic process {Xf : t G T} is a collection of random variables. 
We shall sometimes write X(t) instead of Xt. The variables Xt take values in 
some set A called the state space. The set T is called the index set and 
for our purposes can be thought of as time. The index set can be discrete 
T = {0, 1, 2, . . .} or continuous T = [0, oo) depending on the application. 

23.1 Example (iid observations). A sequence of IID random variables can be 
written as {Xt : t G T} where T = {1,2, 3, ...,}. Thus, a sequence of IID 
random variables is an example of a stochastic process. ■ 

23.2 Example (The Weather). Let A = {sunny, cloudy}. A typical sequence 
(depending on where you live) might be 

sunny, sunny, cloudy, sunny, cloudy, cloudy, • • • 

This process has a discrete state space and a discrete index set. ■ 
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23.3 Example (Stock Prices). Figure 23.1 shows the price of a fictitious stock 
over time. The price is monitored continuously so the index set T is continuous. 
Price is discrete but for all practical purposes we can treat it as a continuous 
variable. ■ 



23.4 Example (Empirical Distribution Function). Let ^ F where 

F is some CDF on [0,1]. Let 

1 

Fn{t) = -y^I{X,<t) 
n 

i=l 

be the empirical CDF. For any fixed value Fn{t) is a random variable. But 
the whole empirical CDF 

|^n(0 • ^ ^ 1] I 

is a stochastic process with a continuous state space and a continuous index 
set. ■ 

We end this section by recalling a basic fact. If are random 

variables, then we can write the joint density as 

/(Xi, . . . ,Xn) = f{xi)f{x2\xi) • • -/(Xnlxi, . . . ,Xn-l) 
n 

= n-^C*|pastj) (23.1) 

^=1 
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where past^ = (Xi, . . . , 



23.2 Markov Chains 

A Markov chain is a stochastic process for which the distribution of Xt de- 
pends only on Xt-i. In this section we assume that the state space is dis- 
crete, either X = N} or X = {1,2,...,} and that the index set is 

T = {0,1,2,...}. Typically, most authors write X^ instead of Xf when dis- 
cussing Markov chains and I will do so as well. 

23.5 Definition. The process {X^ : n e T} is a Markov chain if 

¥{Xn = X I Xo, . . . , Xn-l) = F(x^ = X I Xn-l) (23.2) 

for all n and for all x e X . 

For a Markov chain, equation (23.1) simplihes to 

f{xi, ...,Xn)= f{xi)f{x2\xi)f{xz\x2) ■ ■ ■ f{Xn\Xn-\)- 

A Markov chain can be represented by the following DAG: 



Xo 



Xi 



X2 



Each variable has a single parent, namely, the previous observation. 

The theory of Markov chains is a very rich and complex. We have to get 
through many dehnitions before we can do anything interesting. Our goal is 
to answer the following questions: 

1. When does a Markov chain “settle down” into some sort of equilibrium? 

2. How do we estimate the parameters of a Markov chain? 

3. How can we construct Markov chains that converge to a given equilib- 
rium distribution and why would we want to do that? 

We will answer questions 1 and 2 in this chapter. We will answer question 
3 in the next chapter. To understand question 1, look at the two chains in 
Figure 23.2. The hrst chain oscillates all over the place and will continue to 
do so forever. The second chain eventually settles into an equilibrium. If we 
constructed a histogram of the first process, it would keep changing as we got 
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more and more observations. But a histogram from the second chain would 
eventually converge to some fixed distribution. 





FIGURE 23.2. Two Markov chains. The first chain does not settle down into an 
equilibrium. The second does. 



Transition Probabilities. The key quantities of a Markov chain are the 
probabilities of jumping from one state into another state. A Markov chain is 
homogeneous if = j\Xn = i) does not change with time. Thus, for 

a homogeneous Markov chain, P(X^+i = j\Xn = i) = IP(Xi = j|Xo = i). We 
shall only deal with homogeneous Markov chains. 



23.6 Definition. We call 

Pij = ]P(X„+1 = j\Xn = i) (23.3) 

the transition probabilities. The matrix P whose (i^j) element is pij 
is called the transition matrix. 



We will only consider homogeneous chains. Notice that P has two proper- 
ties: (i) Pij > 0 and (ii) J2iPij = 1- Each row can be regarded as a probability 
mass function. 

23.7 Example (Random Walk With Absorbing Barriers). Let A = {1, . . . , N}. 

Suppose you are standing at one of these points. Flip a coin with P(Heads) = p 
and P(Tails) = = 1 — p. If it is heads, take one step to the right. If it is 

tails, take one step to the left. If you hit one of the endpoints, stay there. The 
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transition matrix is 

■ 1 0 0 0 ••• 0 0 

q 0 p 0 • • • 0 0 

_ 0 q 0 p •••0 0 



0 0 0 0 0 p 

0 0 0 0 0 0 1 

23.8 Example. Suppose the state space is A’ = {sunny, cloudy}. Then Xi, 
X 2 , ... represents the weather for a sequence of days. The weather today 
clearly depends on yesterday’s weather. It might also depend on the weather 
two days ago but as a first approximation we might assume that the depen- 
dence is only one day back. In that case the weather is a Markov chain and a 
typical transition matrix might be 

Sunny Cloudy 
Sunny 0.4 0.6 

Cloudy 0.8 0.2 

For example, if it is sunny today, there is a 60 per cent chance it will be cloudy 
tomorrow. ■ 

Let 

p^jin) = P(V„+„ = j\Xjn = i) (23.4) 

be the probability of of going from state i to state j in n steps. Let be the 
matrix whose (i^j) element is Pij{n). These are called the n-step transition 
probabilities. 

23.9 Theorem (The Chapman-Kolmogorov equations). The n-step probabilities 
satisfy 

Pij{m + n) = '^Pik{m)pkj{n). (23.5) 

k 

Proof. Recall that, in general, 

P(X = x,Y = y)= ¥{X = x)F{Y = y\X = x). 

This fact is true in the more general form 

P(X = x,Y = y\Z = z)= P(X = x|Z = z)¥(Y = y\X = x,Z = z). 

Also, recall the law of total probability: 

P(x = x) = y^P(V = x,Y = y). 
y 
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Using these facts and the Markov property we have 

Pij{m + n) = F{Xm+n = j\Xo = i) 

= '^^{Xm+n=j,Xm = k\Xo = i) 

k 

= ^ F{Xm+n = j\Xm = k,Xo = i)F{Xm = k\Xo = i) 

k 

= Y.^{Xra+n = = fc)P(X„ = k\X^ = i) 

k 

= '^Pik{m)pkj{n). m 
k 

Look closely at equation (23.5). This is nothing more than the equation for 
matrix multiplication. Hence we have shown that 

Pm+n = PmPn- (23.6) 

By definition, Pi = P. Using the above theorem, P 2 = Pi+i = PiPi = 
PP = P^. Continuing this way, we see that 

P^ = P^= PxPx...xP . (23.7) 

V ^ 

multiply the matrix n times 

Let iJLn = (/^n(l )7 • • • 7 be a row vector where 

^n{i)=nXn = i) (23.8) 

is the marginal probability that the chain is in state i at time n. In particular, 

/io is called the initial distribution. To simulate a Markov chain, all you 
need to know is /ip and P. The simulation would look like this: 

Step 1: Draw Xp ^ /ip. Thus, P(Xp = i) = /ip(i). 

Step 2 : Denote the outcome of step 1 by i. Draw Xi ^ P. In other words, 
IP(Xi = j\Xo = i) = Pij. 

Step 3: Suppose the outcome of step 2 is j. Draw X 2 ^ P. In other words, 

F{X2 = k\Xi = j) = Pjk- 

And so on. 

It might be difficult to understand the meaning of jin- Imagine simulating 
the chain many times. Collect all the outcomes at time n from all the chains. 
This histogram would look approximately like /i^. A consequence of theorem 
23.9 is the following: 




23.2 Markov Chains 



387 



23.10 Lemma. The marginal probabilities are given by 



Tn • 



Proof. 

Mn(j') = IP(V„ = j) 

= ^P(X„=j|Xo=i)P(Xo=i) 

i 

= T l^o{i)Pij{n) ^ m 



Summary of Terminology 

1. Transition matrix: P(i, j) = P(X^+i = j\Xn = i) = Pij. 

2. n-step matrix: Pn(^,j) = = j\Xm = i)^ 

3. Pn =P" 

4. Marginal: /in(0 = = 0- 

3* hn — /^qP • 

States. The states of a Markov chain can be classified according to various 
properties. 

23.11 Definition. We say that i reaches j (or j is accessible from i) if 
Pij{n) > 0 for some n, and we write i ^ j. If i ^ j and j ^ i then we 
write i ^ j and we say that i and j communicate. 



23.12 Theorem. The eommunication relation satisfies the following proper- 
ties: 

1. i ^ i. 

2. If i ^ j then j ^ i. 

3. If i ^ j and j ^ k then i ^ k. 

4- The set of states X can be written as a disjoint union of classes T = 
VlUVaU--- where two states i and j communicate with each other if 
and only if they are in the same class. 
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If all states communicate with each other, then the chain is called irre- 
ducible. A set of states is closed if, once you enter that set of states you 
never leave. A closed set consisting of a single state is called an absorbing 
state. 

23.13 Example. Let A = {1,2, 3, 4} and 

I 0 0 \ 

i 0 0 

111 
4 4 4 

\ 0 0 0 1 / 

The classes are {1, 2}, {3} and {4}. State 4 is an absorbing state. ■ 

Suppose we start a chain in state i. Will the chain ever return to state 
If so, that state is called persistent or recurrent. 

23.14 Definition. State i is recurrent or persistent if 

= i for some n > 1 \ Xq = i) = 1. 

Otherwise, state i is transient. 




23.15 Theorem. A state i is reeurrent if and only if 

'^Pti{n) = oo. (23.9) 

n 

A state i is transient if and only if 

'^Pii{n)<oo. (23.10) 

n 

Proof. Define 

f 1 if Xn = i 
" \ 0 if X, ^ L 

The number of times that the chain is in state i is Y = The mean 

of Y , given that the chain starts in state i, is 

oo oo oo 

E(F|Xo = i) = ^E(/„|Xo = i) = Y^nXn = *|Xo =i) = Y,pu{n). 

n=0 n=0 n=0 

Define = P(X^ = i for some n > 1 | Xq = i). If Hs recurrent, = 1. Thus, 
the chain will eventually return to i. Once it does return to i, we argue again 
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that since = 1, the chain will return to state i again. By repeating this 
argument, we conclude that E{Y\Xq = i) = oo. U i is transient, then < 1. 
When the chain is in state i, there is a probability 1 — >0 that it will never 

return to state i. Thus, the probability that the chain is in state i exactly n 
times is — a^). This is a geometric distribution which has hnite mean. 



23.16 Theorem. Facts about recurrence. 

1. If state i is recurrent and i ^ j, then j is recurrent. 

2. If state i is transient and i ^ j, then j is transient. 

3. A finite Markov chain must have at least one recurrent state. 

4 . The states of a finite, irreducible Markov chain are all recurrent. 

23.17 Theorem (Decomposition Theorem). The state space T can be written 
as the disjoint union 

where Xt are the transient states and each Xi is a closed, irreducible set of 
recurrent states. 

23.18 Example (Random Walk). Let X = {. . . , —2, —1, 0, 1, 2, . . . , } and sup- 
pose that = p, Pi,i-i = q = I — p. All states communicate, hence either 

all the states are recurrent or all are transient. To see which, suppose we start 
at Xo = 0. Note that 

Poo(2n) = (23.11) 

since the only way to get back to 0 is to have n heads (steps to the right) and 
n tails (steps to the left). We can approximate this expression using Stirling’s 
formula which says that 



n! ^ n^ylne 

Inserting this approximation into (23.11) shows that 



Poo(2n) 



{Apqy 



niT 



It is easy to check that Xln^oo(^) < 00 if and only if Xln^oo( 2 n) < 00. 
Moreover, Xln^oo(2n) = cxo if and only if p = = 1/2. By Theorem (23.15), 

the chain is recurrent if p = 1/2 otherwise it is transient. ■ 
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Convergence of Markov Chains. To discuss the convergence of chains, 
we need a few more definitions. Suppose that Xq = i. Define the recurrence 
time 

Tij = min{n > 0 : Xn = j} (23.12) 

assuming X^ ever returns to state otherwise define Tij = oo. The mean 
recurrence time of a recurrent state i is 

m = E(Tii) = ^ nfu(n) (23.13) 

n 

where 



(n) = P(Xi ^ j, X 2 ^ j, . . . , ^ j, X, = j\Xo = i). 

A recurrent state is null if = oo otherwise it is called non-null or posi- 
tive. 

23.19 Lemma. If a state is null and recurrent, then ^ 0. 

23.20 Lemma. In a finite state Markov chain, all recurrent states are positive. 

Consider a three-state chain with transition matrix 

" 0 1 0 ■ 

0 0 1. 

1 0 0 

Suppose we start the chain in state 1. Then we will be in state 3 at times 3, 6, 

9, This is an example of a periodic chain. Formally, the period of state i 

is d if Pii{n) = 0 whenever n is not divisible by d and d is the largest integer 
with this property. Thus, d = gcd{n : pu{n) > 0} where gcd means “greater 
common divisor.” State i is periodic if d{i) > 1 and aperiodic if d{i) = 1. 
A state with period 1 is called aperiodic. 

23.21 Lemma. If state i has period d and i ^ j then j has period d. 



23.22 Definition. A state is ergodic if it is recurrent, non-null and 
aperiodic. A chain is ergodic if all its states are ergodic. 

Let 7T = (iTi : i G A) be a vector of non- negative numbers that sum to one. 
Thus 7T can be thought of as a probability mass function. 

23.23 Definition. We say that tt is a stationary (or invariant^ 

distribution i/ tt = ttP . 
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Here is the intuition. Draw Xq from distribution tt and suppose that tt is a 
stationary distribution. Now draw Xi according to the transition probability 
of the chain. The distribution of Xi is then jii = /ioP = ttP = tt. The 
distribution of X2 is ttP^ = (7tP)P = ttP = tt. Continuing this way, we see 
that the distribution of Xn is ttP^ = tt. In other words: 

If at any time the chain has distribution tt, then it will continue to 
have distribution tt forever. 



23.24 Definition. We say that a chain has limiting distribution if 




7T 




pn 


7T 






7T 




for some ir, that is, ttj = lim^^oo P^^- 


exists and is independent ofi. 



Here is the main theorem about convergence. The theorem says that an 
ergodic chain converges to its stationary distribution. Also, sample averages 
converge to their theoretical expectations under the stationary distribution. 



23.25 Theorem. An irreducible, ergodic Markov chain has a unique 
stationary distribution tt. The limiting distribution exists and is equal to 
7T. If g is any bounded function, then, with probability 1, 

1 ^ 

iv ^ IE,r(S') = (23.14) 

n=l j 



Finally, there is another definition that will be useful later. We say that tt 
satisfies detailed balance if 



TTiPij = PjiTTj. (23.15) 

Detailed balance guarantees that tt is a stationary distribution. 

23.26 Theorem. If 7T satisfies detailed balance, then n is a stationary distri- 
bution. 

Proof. We need to show that ttP = tt. The element of ttP is iTiPij = 
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The importance of detailed balance will become clear when we discuss 
Markov chain Monte Carlo methods in Chapter 24. 

Warning! Just because a chain has a stationary distribution does not mean 
it converges. 

23.27 Example. Let 

r 0 1 0 " 

P= 0 0 1 . 

_ 1 0 0 _ 

Let 7T = (1/3, 1/3, 1/3). Then ttP = tt so tt is a stationary distribution. If 
the chain is started with the distribution tt it will stay in that distribution. 
Imagine simulating many chains and checking the marginal distribution at 
each time n. It will always be the uniform distribution tt. But this chain does 
not have a limit. It continues to cycle around forever. ■ 

Examples of Markov Chains. 

23.28 Example. Let T = {1, 2, 3, 4, 5, 6}. Let 

i i 0 0 0 0 ■ 

i I 0 0 0 0 

i i i i 0 0 

4444'^'^ 

^ 0 i ^ 0 i 

0 0 0 0 ^ i 

0 0 0 0 ^ ^ _ 

Then C\ = {1,2} and C 2 = {5,6} are irreducible closed sets. States 3 and 
4 are transient because of the path 3 ^ 4 ^ 6 and once you hit state 6 
you cannot return to 3 or 4. Since pa{l) > 0, all the states are aperiodic. In 
summary, 3 and 4 are transient while 1, 2, 5, and 6 are ergodic. ■ 

23.29 Example (Ha rdy-Weinberg). Here is a famous example from genetics. 
Suppose a gene can be type A or type a. There are three types of people (called 
genotypes): AA, Aa, and aa. Let (p, r) denote the fraction of people of each 
genotype. We assume that everyone contributes one of their two copies of the 
gene at random to their children. We also assume that mates are selected at 
random. The latter is not realistic however, it is often reasonable to assume 
that you do not choose your mate based on whether they are AA, Aa, or 
aa. (This would be false if the gene was for eye color and if people chose 
mates based on eye color.) Imagine if we pooled everyone’s genes together. 
The proportion of A genes is P = p -h (q/2) and the proportion of a genes is 
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Q = r (q/2). A child is AA with probability aA with probability 2PQ, 
and aa with probability Q^. Thus, the fraction of A genes in this generation 
is 

p'+pQ={p+lT+{p+l) ("+!)■ 

However, r = 1 — p — q. Substitute this in the above equation and you get 
P^ + PQ = P. A similar calculation shows that the fraction of “a” genes is 
Q. We have shown that the proportion of type A and type a is P and Q and 
this remains stable after the first generation. The proportion of people of type 
AA, Aa, aa is thus (P^, 2PQ, Q^) from the second generation and on. This is 
called the Hardy- Weinberg law. 

Assume everyone has exactly one child. Now consider a fixed person and 
let Xn be the genotype of their descendant. This is a Markov chain with 
state space A = {AA, Aa, aa}. Some basic calculations will show you that the 
transition matrix is 

' P Q O' 

P P+Q Q 

2 2 2 * 

OP Q _ 

The stationary distribution is tt = (P^, 2PQ, Q^). m 



23.30 Example (Markov chain Monte Carlo). In Chapter 24 we will present a 
simulation method called Markov chain Monte Carlo (MCMC). Here is a brief 
description of the idea. Let f{x) be a probability density on the real line and 
suppose that f{x) = cg{x) where g{x) is a known function and c > 0 is 
unknown. In principle, we can compute c since J f{x)dx = 1 implies that 
c = 1/ f g(x)dx. However, it may not be feasible to perform this integral, nor 
is it necessary to know c in the following algorithm. Let Xq be an arbitrary 
starting value. Given Xq, ..., X^, draw X^+i as follows. First, draw W ~ 
N{Xi^ b‘^) where b > 0 is some fixed constant. Let 



r = min 



gjw) 1 

gixp i 



Draw U ^ Uniform(0, 1) and set 



X,+i = 



W if U <r 

X, if U> r. 



We will see in Chapter 24 that, under weak conditions, Xq,Xi,..., is an 
ergodic Markov chain with stationary distribution /. Hence, we can regard 
the draws as a sample from /. ■ 
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Inference for Markov Chains. Consider a chain with finite state space 
A’ = {1,2,..., N}. Suppose we observe n observations Xi, . . . , from this 
chain. The unknown parameters of a Markov chain are the initial probabilities 
/^o = (/^o(l), /^o(2), . . . ,) and the elements of the transition matrix P. Each 
row of P is a multinomial distribution. So we are essentially estimating N 
distributions (plus the initial probabilities). Let riij be the observed number 
of transitions from state i to state j. The likelihood function is 

n N N 

£(/U0,P) = l^o{xo)'[[px^-uX^ = Po{xo) n riPi"/- 

r=l i=l j=l 

There is only one observation on /io so we can’t estimate that. Rather, we 
focus on estimating P. The mle is obtained by maximizing £(/io,P) subject 
to the constraint that the elements are non- negative and the rows sum to 1. 
The solution is 

Pij = — 

rii 

where rii = we are assuming that > 0. If not, then we set 

Pij = 0 by convention. 



23.31 Theorem (Consistency and Asymptotic Normality of the mle). Assume that 
the chain is ergodic. Let Pij{n) denote the mle after n observations. Then 

^ p 

Pij{n) — >pij. Also, 



-Py)J 

where the left-hand side is a matrix, Ni{n) 



^ iV(0,S) 



E 



ij,k£ 



Pij{i-Pij) = 

PijPie I ~ h,j 7^ ^ 
0 otherwise. 



i) and 



23.3 Poisson Processes 

The Poisson process arises when we count occurrences of events over time, for 
example, traffic accidents, radioactive decay, arrival of email messages, etc. 
As the name suggests, the Poisson process is intimately related to the Poisson 
distribution. Let’s first review the Poisson distribution. 

Recall that X has a Poisson distribution with parameter A — written X ^ 
Poisson(A) — if 

g— A A^c 

P(X = x) = p(x; A) = ; — , X = 0, 1, 2, . . . 

XI 
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Also recall that E(X) = A and V(X) = A. If X ^ Poisson(A), Y ^ Poisson(z/) 
and Xliy, then X-\-Y ^ Poisson(A+z/). Finally, if X ^ Poisson(A) and Y\N = 
n ^ Binomial(n,p), then the marginal distribution of F is F ^ Poisson(Ap). 

Now we describe the Poisson process. Imagine that you are at your com- 
puter. Each time a new email message arrives you record the time. Let Xt be 
the number of messages you have received up to and including time t. Then, 
{Xt : t G [0, oo)} is a stochastic process with state space X = {0, 1,2,.. .}. 
A process of this form is called a counting process. A Poisson process is 
a counting process that satishes certain conditions. In what follows, we will 
sometimes write X(t) instead of Xt. Also, we need the following notation. 
Write f(h) = o(h) if f(h)/h ^ 0 as h ^ 0. This means that f(h) is smaller 
than h when h is close to 0. For example, = o(h). 

23.32 Definition. A Poisson process is a stochastic process 
{Xt : t G [0, oo)} with state space A = {0, 1 , 2 ,.. .} such that 

1. X{0) =0. 

2. For any 0 = to < ti < ^2 < • • • < X; the increments 

X(ti)-X(to), X{t 2 )-X{h), •••,X(tn)-X(tn-l) 
are independent. 

3. There is a function \{t) such that 

F{X{tFh) -X{t) = 1) = X{t)hFo{h) (23.16) 

¥{X{tFh)-X{t)>2) = o{h). (23.17) 

We call X(t) the intensity function. 

The last condition means that the probability of an event in [t, t + h] is 
approximately hX{t) while the probability of more than one event is small. 

23.33 Theorem. If Xt is a Poisson process with intensity function X{t), then 

X{s -\-t) — X(s) ^ Poisson(m(s + t) — m{s)) 

where ^ 

m{t) = / A(s) ds. 

Jo 

In particular, X{t) ^ Poisson(m(t)). Hence, E{X{t)) = m{t) andY{X{t)) = 
m{t). 
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23.34 Definition. A Poisson process with intensity function X{t) = A for 
some X > 0 is called a homogeneous Poisson process with rate X. In 
this case, 

X{t) ^ Poisson(At). 



Let X{t) be a homogeneous Poisson process with rate A. Let W^, be the 
time at which the event occurs and set Wq = 0. The random variables 
Wo, Wi, . . . , are called waiting times. Let Sn = Wn-\-i — Wn- Then Sq^Si, . , . , 
are called sojourn times or interarrival times. 

23.35 Theorem. The sojourn times So, Si, .. . are IID random variables. Their 
distribution is exponential with mean 1 jX, that is, they have density 

f{s) = Xe~^^, s > 0. 

The waiting time Wn ^ Gamma(n, 1/A) i.e., it has density 

f(w) = 

r(n) 

Hence, E(Wn) = n/X andX{Wn) = n/AT 
Proof. First, we have 

¥{Si >t)= ¥{X{t) = 0) = e~^^ 

with shows that the CDF for Si is 1 — e~^^. This shows the result for Si. Now, 

P(/F 2 > t\Si = s) = P(no events in (<s, s + t] |/Si = s) 

= P(no events in (<s, 5 + t]) (increments are independent) 
= e-^'. 

Hence, S 2 has an exponential distribution and is independent of Si . The result 
follows by repeating the argument. The result for Wn follows since a sum of 
exponentials has a Gamma distribution. ■ 

23.36 Example. Figure 23.3 shows requests to a WWW server in Galgary.^ 
Assuming that this is a homogeneous Poisson process, N = X(T) ^ Poisson(AT). 
The likelihood is 

£(A) oc e“^^(AT)^ 



^See http://ita.ee.lbl.gov/html/contrib/Calgary-HTTP.html for more information. 
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FIGURE 23.3. Hits on a web server. Each vertical line represents one event. 



which is maximized at 

- N 

A = — = 48.0077 
T 

in units per minute. Let’s now test the assumption that the data follow a ho- 
mogeneous Poisson process using a goodness-of-ht test. We divide the interval 
[0, T] into 4 equal length intervals /i, / 2 , /a, I 4 ,. If the process is a homogeneous 
Poisson process then, given the total number of events, the probability that an 
event falls into any of these intervals must be equal. Let Pi be the probability 
of a point being in /^. The null hypothesis is that = ^>2 = Ps = P 4 = 1/4. 
We can test this hypothesis using either a likelihood ratio test or a test. 
The latter is 

(O^ — EiY 

i=l ^ 

where Oi is the number of observations in li and Ei = n/4 is the expected 
number under the null. This yields = 252 with a p-value near 0. This is 
strong evidence against the null so we reject the hypothesis that the data are 
from a homogeneous Poisson process. This is hardly surprising since we would 
expect the intensity to vary as a function of time. ■ 



23.4 Bibliographic Remarks 

This is standard material and there are many good references including Grim- 
mett and Stirzaker (1982), Taylor and Karlin (1994), Guttorp (1995), and 
Ross (2002). The following exercises are from those texts. 
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23.5 Exercises 

1. Let Xo,Xi,... be a Markov chain with states {0,1,2} and transition 
matrix 



0.1 


0.2 


0.7 


0.9 


0.1 


0.0 


0.1 


0.8 


0.1 



Assume that /io = (0.3, 0.4, 0.3). Find P(Xq = 0, Xi = 1, X 2 = 2) and 
P(Xo = 0,Xi = l,X 2 = l). 

2. Let li,l 2 , • • • be a sequence of iid observations such that P(F = 0) = 
0.1, P(y = 1) = 0 . 3 , P(y = 2) = 0 . 2 , P(y = 3) = 0 . 4 . Let Xo = 0 and 
let 

Xn = max{Yi, . . . ,Fn}- 

Show that Xo, Xi, . . . is a Markov chain and find the transition matrix. 

3. Consider a two- state Markov chain with states X = {1,2} and transition 
matrix 

_ 1 — a a 

^ ^ ^ b 1-b 

where 0 < a < 1 and 0 < 6 < 1. Prove that 

b a 

Qj-j-b ci-\-b 
h a 

<i-\-b <i-\-b 

4. Consider the chain from question 3 and set a = .1 and b = .3. Simulate 
the chain. Let 



lim = 

n^oo 



Pn{l) = = 

1 "" 

p„{2) = -V/(V = 2) 

n 

be the proportion of times the chain is in state 1 and state 2. Plot Pn(l) 
and Pn(2) versus n and verify that they converge to the values predicted 
from the answer in the previous question. 

5. An important Markov chain is the branching process which is used in 
biology, genetics, nuclear physics, and many other fields. Suppose that 
an animal has Y children. Let pk = P(X = k). Hence, p/c > 0 for all 
k and ^ Assume each animal has the same lifespan and 
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that they produce offspring according to the distribution Let be 
the number of animals in the generation. Let , • • • , be the 
offspring produced in the generation. Note that 

= + --- + yjpj. 

Let /i = E(y) and = Y{Y). Assume throughout this question that 
Xo = 1. Let M(n) = E(X^) and V{n) = V(Xn). 

(a) Show that M(n + 1) = pM{n) and y(n + 1) = a‘^M{n) + iJ?V(n). 

(b) Show that M (n) = /i^ and that V{n) = + 

(c) What happens to the variance if /i > 1? What happens to the vari- 
ance if /i = 1? What happens to the variance if /i < 1? 

(d) The population goes extinct if Xn = 0 for some n. Let us thus dehne 
the extinction time N by 



N = min{n : Xn = 0}. 

Let F{n) = E(iV < n) be the CDF of the random variable N. Show that 



F(n) = ^Pk{F{n-l)f, n = l,2, ... 
k=0 



Hint: Note that the event {N < n} is the same as event {Xn = 0}. 
Thus, E({X < n}) = ¥{{Xn = 0}). Let k be the number of offspring 
of the original parent. The population becomes extinct at time n if and 
only if each of the k sub-populations generated from the k offspring goes 
extinct in n — 1 generations. 

(e) Suppose that po = 1/4, pi = 1/2, p 2 = 1/4. Use the formula from 
(5d) to compute the CDF F{n). 



6. Let 



0.40 


0.50 


0.10 


0.05 


0.70 


0.25 


0.05 


0.50 


0.45 



Find the stationary distribution tt. 



7. Show that if i is a recurrent state and i ^ j, then j is a recurrent state. 
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8. Let 

■ i 0 i 0 0 i 

1 i i 0 0 0 

2 4 4 u u 

^ 0 0 0 0 1 0 



0 0 1 0 0 0 

0 0 0 0 0 1 

Which states are transient? Which states are recurrent? 



9. Let 

1 0 

Show that 7T = (1/2, 1/2) is a stationary distribution. Does this chain 
converge? Why /why not? 



10. Let 0 < p < 1 and q = 1 — p. Let 



P = 



q p 0 0 0 
q 0 p 0 0 
q 0 0 p 0 
q 0 0 0 p 
1 0 0 0 0 



Find the limiting distribution of the chain. 



11. Let X{t) be an inhomogeneous Poisson process with intensity function 
X(t) > 0. Let A(t) = A(u)du. Define V(s) = X(t) where s = A(t). 
Show that F(s) is a homogeneous Poisson process with intensity A = 1. 

12. Let X(t) be a Poisson process with intensity A. Find the conditional 
distribution of X(t) given that X(t -h s) = n. 

13. Let X(t) be a Poisson process with intensity A. Find the probability 
that X(t) is odd, i.e. P(X(t) = 1, 3, 5, . . .). 



14. Suppose that people logging in to the University computer system is 
described by a Poisson process X(t) with intensity A. Assume that a 
person stays logged in for some random time with CDF G. Assume these 
times are all independent. Let V (t) be the number of people on the 
system at time t. Find the distribution of 

15. Let X{t) be a Poisson process with intensity A. Let VFi, LF 2 , . . . , be the 
waiting times. Let / be an arbitrary function. Show that 

= A 
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16 . A two-dimensional Poisson point process is a process of random points 
on the plane such that (i) for any set A, the number of points falling 
in A is Poisson with mean Xfi{A) where fi{A) is the area of A, (ii) the 
number of events in non-overlapping regions is independent. Consider 
an arbitrary point xq in the plane. Let X denote the distance from xq 
to the nearest random point. Show that 

P(X >t) = 



and 



E(X) 



1 




24 

Simulation Methods 



In this chapter we will show how simulation can be used to approximate inte- 
grals. Our leading example is the problem of computing integrals in Bayesian 
inference but the techniques are widely applicable. We will look at three inte- 
gration methods: (i) basic Monte Carlo integration, (ii) importance sampling, 
and (hi) Markov chain Monte Carlo (MCMC). 
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6 = (6^1, . . . , 6>/c) is multidimensional, then we might be interested in the 
posterior for one of the components, 6>i, say. This marginal posterior density 
is 

/(0i|x") = J I ■■■ I f{ei,...,ek\x'^)d02---d9k 

which involves high-dimensional integration. 

When 0 is high-dimensional, it may not be feasible to calculate these inte- 
grals analytically. Simulation methods will often be helpful. 



24.2 Basic Monte Carlo Integration 

Suppose we want to evaluate the integral 

.6 

/ = / h{x) dx 

J a 

for some function h. If h is an “easy” function like a polynomial or trigono- 
metric function, then we can do the integral in closed form. If h is complicated 
there may be no known closed form expression for I. There are many numer- 
ical techniques for evaluating I such as Simpson’s rule, the trapezoidal rule 
and Gaussian quadrature. Monte Carlo integration is another approach for 
approximating I which is notable for its simplicity, generality and scalability. 
Let us begin by writing 

/ = / h{x)dx = / w{x)f{x)dx (24.1) 

J a J a 

where w{x) = h{x){b—a) and f{x) = l/{b—a). Notice that / is the probability 
density for a uniform random variable over (a, b). Hence, 

/ = E/WX)) 

where X ^ Unif(a,6). If we generate Xi, . . . ,Xat ^ Unif(a,6), then by the 
law of large numbers 

1 ^ 

( 24 . 2 ) 

i=l 

This is the basic Monte Carlo integration method. We can also compute 
the standard error of the estimate 
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where 

N-l 

where Yi = w{Xi). A 1 — a conhdence interval for / is / We can take 
N as large as we want and hence make the length of the conhdence interval 
very small. 

24.1 Example. Let h{x) = x^. Then, I = x^dx = 1/4. Based on = 
10, 000 observations from a Uniform (0, 1) we get I = .248 with a standard 
error of .0028. ■ 

A generalization of the basic method is to consider integrals of the form 

/ = y* h{x)f{x)dx (24.3) 

where f(x) is a probability density function. Taking / to be a Uniform (a,b) 
gives us the special case above. Now we draw Xi, . . . , ^ / and take 

7^-5; MX.) 

i=l 

as before. 

24.2 Example. Let 

be the standard Normal PDF. Suppose we want to compute the CDF at some 



point x: 


nX 




1= / f{s)ds = ^{x) 


Write 


/ = / h{s)f{s)ds 


where 


, . ( 1 S < X 

= | 0 .>x. 


Now we generate Xi, . . 


. . , Xn ^ A^(0, 1) and set 



— 1 w X number of observations < x 

= ^ . 

For example, with x = 2, the true answer is <L(2) = .9772 and the Monte 
Carlo estimate with N = 10,000 yields .9751. Using N = 100,000 we get 
.9771. ■ 
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24.3 Example (Bayesian Inference for Two Binomials). Let X ^ Binomial(n,pi) 
and Y ^ Binomial(m,p 2 )- We would like to estimate S = P 2 — Pi- The MLE 
is S = P 2 — Pi = (Y/m) — (X/n). We can get the standard error ^ using the 
delta method which yields 

/ pi(i-pi) ^ P2(i-fey 
V n m 

and then construct a 95 percent confidence interval (5 ± 2 Now consider a 
Bayesian analysis. Suppose we use the prior /(pi,P 2 ) = f{Pi)f{P 2 ) — 1, that 
is, a fiat prior on (pi,p 2 )- The posterior is 



The posterior mean of 5 is 

~S=( [ S{puP2)f{Pl,P2\X,Y) = 

Jo Jo 

If we want the posterior density of S we can first get the posterior CDF 



/ / i.P2-Pl)f{Pl,P2\X,Y). 

Jo Jo 



F{c\X,Y) = P{6<c\X,Y)= / f{p^,p 2 \X,Y) 

Ja 

where M = {(pi,p 2 ) • P 2 ~ Pi ^ . The density can then be obtained by 

differentiating F. 

To avoid all these integrals, let’s use simulation. Note that /(pi,p 2 |X, Y) = 
f{pi\X)f{p 2 \Y) which implies that pi and p 2 are independent under the pos- 
terior distribution. Also, we see thatpijX ^ Beta(X + l, n— X + 1) andp 2 |T ^ 
Beta(T + 1, m — T + 1). Hence, we can simulate {Pi^\ ^ ^ 2 ^"^) 

from the posterior by drawing 

^ Beta(X + 1, n — X + 1) 

^ Beta(T + 1, m — T + 1) 

for i = 1, . . . , X. Now let (5^^) = P^^ - P^\ Then, 

N 

i 

We can also get a 95 percent posterior interval for 5 by sorting the simulated 
values, and finding the .025 and .975 quantile. The posterior density /(^ |X, Y) 
can be obtained by applying density estimation techniques to . . . , 
or, simply by plotting a histogram. For example, suppose that n = m = 10, 
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-o.e 0.0 o.e 

FIGURE 24.1. Posterior of S from simulation. 

X = 8 and Y = 6. From a posterior sample of size 1000 we get a 95 percent 
posterior interval of (-0.52,0.20). The posterior density can be estimated from 
a histogram of the simulated values as shown in Figure 24.1. ■ 

24.4 Example (Bayesian Inference for Dose Response). Suppose we conduct an 
experiment by giving rats one of ten possible doses of a drug, denoted by 
< X 2 < . . . < xio- For each dose level Xi we use n rats and we observe 
Yi, the number that survive. Thus we have ten independent binomials Yi ^ 
Binomial(n,p^). Suppose we know from biological considerations that higher 
doses should have higher probability of death. Thus, Pi < P 2 < • • • < Pio- We 
want to estimate the dose at which the animals have a 50 percent chance of 
dying. This is called the LD50. Formally, S = Xj where 

j = min{i : pi > .50}. 

Notice that S is implicitly a (complicated) function of pi, . . . ,pio so we can 
write S = g{pi , . . . , pio) for some g. This just means that if we know (pi , . . . , pio) 
then we can hnd 5. The posterior mean of 5 is 

[ g{pi, ■ ■ ■,Pw)f{Pi, ■ ■ ■ ,Pio\yi, ■ ■ ■Xw)dpidp 2 ■ ■ ■ dpio. 

A 

The integral is over the region 

A = {{pi, ■ ■ ■ ,Pio) ■■ Pl<---< Pio}- 
The posterior CDF of S is 

F{c\Yu...,Yio) = F{5<c\Yu...,Yio) 

^ f{Pi, ■ ■ ■ ■,Pw\Yi , . . . , Yio)dpidp2 ■ ■ ■ dpw 

B 
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where 

B = ^P||(Pi---->Pio) : 5(Pi,---,Pio) < c 

We need to do a 10-dimensional integral over a restricted region A. Instead, 
we will use simulation. Let us take a flat prior truncated over A. Except for 
the truncation, each Pi has once again a Beta distribution. To draw from the 
posterior we do the following steps: 

(1) Draw Pi ^ Beta(T^ + 1, n — + 1), i = 1, . . . , 10. 

(2) If Pi < P 2 < • * • < Pio keep this draw. Otherwise, throw it away and 
draw again until you get one you can keep. 

(3) Let 5 = Xj where 



j = min{i : Pi > .50}. 

We repeat this N times to get . . . , and take 



5 is a discrete variable. We can estimate its probability mass function by 

1 ^ 

p(,5 = , Uo) « ^ E = ■?■)■ 



i=l 



For example, consider the following data: 



Dose 

Number of animals 
Number of survivors K 



123456789 10 

15 15 15 15 15 15 15 15 15 15 

0 0 2 2 8 10 12 14 15 14 



The posterior draws for pi, . . . ,pio are shown in the second panel in the 
figure. We find that that S = 4.04 with a 95 percent interval of (3,5). ■ 



24.3 Importance Sampling 

Consider again the integral I = f h{x)f{x)dx where / is a probability density. 
The basic Monte Carlo method involves sampling from /. However, there 
are cases where we may not know how to sample from /. For example, in 
Bayesian inference, the posterior density density is is obtained by multiplying 
the likelihood £{0) times the prior f{0). There is no guarantee that f{0\x) 
will be a known distribution like a Normal or Gamma or whatever. 
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Importance sampling is a generalization of basic Monte Carlo which over- 
comes this problem. Let ^ be a probability density that we know how to 
simulate from. Then 

x= [ = Eg{Y) (24.4) 

J 9\^) 

where Y = h{X)f{X)/g{X) and the expectation lE^(T) is with respect to g. 
We can simulate Xi, . . . , X^ ^ g and estimate I by 



/ = y h{x)f{x)d 



^ g{Xi) 



(24.5) 



^ P 

This is called importance sampling. By the law of large numbers, I — >1. 
However, there is a catch. It’s possible that I might have an inhnite standard 
error. To see why, recall that I is the mean of w{x) = h{x)f{x)/g{x). The 
second moment of this quantity is 



E,(uP[X)) = j = j 



K^{x)p{x) 

9{x) 



dx. 



(24.6) 



If g has thinner tails than /, then this integral might be inhnite. To avoid this, 
a basic rule in importance sampling is to sample from a density g with thicker 
tails than /. Also, suppose that g{x) is small over some set A where f(x) is 
large. Again, the ratio of f /g could be large leading to a large variance. This 
implies that we should choose g to be similar in shape to /. In summary, a 
good choice for an importance sampling density g should be similar to / but 
with thicker tails. In fact, we can say what the optimal choice of g is. 



24.5 Theorem. The choice of g that minimizes the variance of I is 

^ \h{x)\f{x) 

^ ^ > f\h{s)\fis)ds- 

Proof. The variance of w = fh/g is 
E. 



= J w‘^{x)g{x)dx — w{x)g{x)da 
f h‘^(x)p(x) f f h(x)f(x) 
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The second integral does not depend on so we only need to minimize the 
first integral. From Jensen’s inequality (Theorem 4.9) we have 



> (EsdW^I))' 



\h{x)\f{x)dx 



2 



This establishes a lower bound on E^(VF^). However, equals this 

lower bound which proves the claim. ■ 

This theorem is interesting but it is only of theoretical interest. If we did 
not know how to sample from / then it is unlikely that we could sample from 
\h{x)\f{x) / J \h{s)\f{s)ds. In practice, we simply try to find a thick-tailed 
distribution g which is similar to f\h\. 

24.6 Example (Tail Probability). Let’s estimate I = E(Z > 3) = .0013 where 

Z ^ A^(0,1). Write I = f h{x)f{x)dx where f{x) is the standard Normal 
density and h{x) = 1 if x > 3, and 0 otherwise. The basic Monte Carlo 
estimator is / = N~^ where Xi, . . . ,Xn ^ X(0, 1). Using N = 100 

we find (from simulating many times) that E(7) = .0015 and V(7) = .0039. 
Notice that most observations are wasted in the sense that most are not near 
the right tail. Now we will estimate this with importance sampling taking g 
to be a Normal(4,l) density. We draw values from g and the estimate is now 
7 = N~^ /(X^)/i(X^)/^(X^). In this case we find that E(7) = .0011 and 

V(7) = .0002. We have reduced the standard deviation by a factor of 20. ■ 

24.7 Example (Measurement Model With Outliers). Suppose we have measure- 
ments Xi, . . . , Xn of some physical quantity 6. A reasonable model is 



X, = 0^e,. 



If we assume that ^ X(0, 1) then X^ ^ X(6>^,1). However, when taking 
measurements, it is often the case that we get the occasional wild observation, 
or outlier. This suggests that a Normal might be a poor model since Normals 
have thin tails which implies that extreme observations are rare. One way to 
improve the model is to use a density for with a thicker tail, for example, 
a t-distribution with degrees of freedom which has the form 



t{x) 



r(f) 




(zy+l)/2 



Smaller values of h> correspond to thicker tails. For the sake of illustration we 
will take ly = 3. Suppose we observe nX^ = 6> + e^,i = l,...,n where has 
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a t distribution with u = 3. We will take a flat prior on 0. The likelihood is 
C{0) = nr=i posterior mean of 0 is 

^ femdo 

I mde ■ 

We can estimate the top and bottom integral using importance sampling. We 
draw Oi, . . . ^ g and then 

J_ 

N l^j=l g{0,) 

J_ ^{^j) 

N 2^3 = 1 g{9j) 

To illustrate the idea, we drew n = 2 observations. The posterior mean (com- 
puted numerically) is -0.54. Using a Normal importance sampler g yields an 
estimate of -0.74. Using a Cauchy (t-distribution with 1 degree of freedom) 
importance sampler yields an estimate of -0.53. ■ 



24.4 MCMC Part I: The Metropolis Hastings 
Algorithm 

Consider once more the problem of estimating the integral I = f h(x)f(x)dx. 
Now we introduce Markov chain Monte Carlo (MCMC) methods. The idea is 
to construct a Markov chain Xi,X 2 , . . . , whose stationary distribution is /. 
Under certain conditions it will then follow that 

1 ^ 

-Y^h{Xi)X^¥.f{h{X)) = I. 

i=l 

This works because there is a law of large numbers for Markov chains; see 
Theorem 23.25. 

The Metropolis-Hastings algorithm is a specihc MCMC method that 
works as follows. Let q{y\x) be an arbitrary, friendly distribution (i.e., we 
know how to sample from q{y\x)). The conditional density q{y\x) is called 
the proposal distribution. The Metropolis-Hastings algorithm creates a 
sequence of observations Xq, Xi, . . . , as follows. 

Metropolis-Hastings Algorithm 

Choose Xq arbitrarily. Suppose we have generated Xq, Xi, . . . , X^. To 
generate X^+i do the following: 

(1) Generate a proposal or candidate value Y q{y\Xi). 
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(2) Evaluate r = r(X^,E) where 



(3) Set 



r{x,y) 



= min 



f f{y) q{x\y) 
\f{x) q{y\x)' 




f Y with probability r 
^ Xi with probability 1 — r. 



24.8 Remark. A simple way to execute step (3) is to generate U (0,1). If 
U < r set Xi-^i = Y otherwise set = Xi. 



24.9 Remark. A common choice for q(y\x) is N{x,b‘^) for some b > 0. This 
means that the proposal is draw from a Normal, centered at the current 
value. In this case, the proposal density q is symmetric, q{y\x) = q{x\y)^ and 
r simplifies to 



r = min 



fjy) 

fix^y 



1 



}■ 



By construction, Xq, Ai, . . . is a Markov chain. But why does this Markov 
chain have / as its stationary distribution? Before we explain why, let us first 
do an example. 



24.10 Example. The Cauchy distribution has density 



fix) 



1 1 

7T 1 + 



Our goal is to simulate a Markov chain whose stationary distribution is /. 
As suggested in the remark above, we take q{y\x) to be a N{x,b‘^). So in this 
case. 



r{x,y) 



= min 



fjy) 

fi^y 




So the algorithm is to draw Y N{X,,9) and set 



_ { Y with probability r(X^,T) 

\ Xi with probability 1 — r(X^, y). 

The simulator requires a choice of b. Figure 24.2 shows three chains of length 
N = 1,000 using 6 = .1, 6 = 1 and b = 10. Setting b = .1 forces the chain 
to take small steps. As a result, the chain doesn’t “explore” much of the 
sample space. The histogram from the sample does not approximate the true 
density very well. Setting b = 10 causes the proposals to often be far in the 
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FIGURE 24.2. Three Metropolis chains corresponding to b = .1, b = 1, b = 10. 

tails, making r small and hence we reject the proposal and keep the chain 
at its current position. The result is that the chain “gets stuck” at the same 
place quite often. Again, this means that the histogram from the sample does 
not approximate the true density very well. The middle choice avoids these 
extremes and results in a Markov chain sample that better represents the 
density sooner. In summary, there are tuning parameters and the efficiency 
of the chain depends on these parameters. We’ll discuss this in more detail 
later. ■ 

If the sample from the Markov chain starts to “look like” the target distri- 
bution / quickly, then we say that the chain is “mixing well.” Constructing a 
chain that mixes well is somewhat of an art. 

Why It Works. Recall from Chapter 23 that a distribution tt satishes 
detailed balance for a Markov chain if 

Pij^i — Pji^j’ 

We showed that if tt satisfies detailed balance, then it is a stationary distri- 
bution for the chain. 

Because we are now dealing with continuous state Markov chains, we will 
change notation a little and write p{x^y) for the probability of making a 
transition from x to y. Also, let’s use f{x) instead of tt for a distribution. In 
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this new notation, / is a stationary distribution if f{x) = f f{y)p{y,x)dy and 
detailed balance holds for / if 

f{x)p(x,y) = f(y)p{y,x). (24.7) 



Detailed balance implies that / is a stationary distribution since, if detailed 
balance holds, then 



f{y)p{y,x)dy = J f{x)p{x,y)dy = f{x) 



p{x, y)dy = f{x) 



which shows that f{x) = / f{y)p{y^ x)dy as required. Our goal is to show that 
/ satisfies detailed balance which will imply that / is a stationary distribution 
for the chain. 

Consider two points x and y. Either 



f{x)q{y\x) < f{y)q{x\y) or f{x)q{y\x) > f{y)q{x\y). 



We will ignore ties (which occur with probability zero for continuous distribu- 
tions). Without loss of generality, assume that f{x)q{y\x) > f{y)q{x\y). This 
implies that 



r{x,y) 



f{y) q{x\y) 

f{x) q{y\x) 



and that r{y^x) = 1. Now p{x^y) is the probability of jumping from x to y. 
This requires two things: (i) the proposal distribution must generate and 
(ii) you must accept y. Thus, 



P{x,y) = q{y\x)r{x,y) = q{y\x) 



f{y) q{x\y) ^ f{y) 

f{x) q{y\x) f{x) 



q{x\y)- 



Therefore, 



f{x)p{x,y) = f{y)q{x\y). 



(24.8) 



On the other hand, p(^, x) is the probability of jumping from y to x. This 
requires two things: (i) the proposal distribution must generate x, and (ii) you 
must accept x. This occurs with probability p{y^x) = q{x\y)r{y^x) = q{x\y). 
Hence, 

f{y)p{y,x) = f{y)q{x\y). (24.9) 



Comparing (24.8) and (24.9), we see that we have shown that detailed balance 
holds. 
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There are different types of MCMC algorithm. Here we will consider a few of 
the most popular versions. 

Random-Walk-Metropolis-Hastings. In the previous section we con- 
sidered drawing a proposal Y of the form 

= X, + e, 



where comes from some distribution with density g. In other words, q(ylx) = 
g{y — x). We saw that in this case, 



r{x,y) 




This is called a random-walk-Metropolis— Hastings method. The reason 
for the name is that, if we did not do the accept-reject step, we would be 
simulating a random walk. The most common choice for ^ is a A^(0, 5^). The 
hard part is choosing b so that the chain mixes well. A good rule of thumb is: 
choose b so that you accept the proposals about 50 percent of the time. 

Warning! This method doesn’t make sense unless X takes values on the 
whole real line. If X is restricted to some interval then it is best to transform 
X. For example, if X G (0, oo) then you might take Y = logX and then 
simulate the distribution for Y instead of X. 

Independence- Metropolis-Hastings. This is an importance-sampling 
version of MCMC. We draw the proposal from a fixed distribution g. Gen- 
erally, g is chosen to be an approximation to /. The acceptance probability 
becomes 



r{x,y) 



min 



. f{y) 9{x) \ 
' fix) 9{y)i 



Gibbs Sampling. The two previous methods can be easily adapted, in 
principle, to work in higher dimensions. In practice, tuning the chains to make 
them mix well is hard. Gibbs sampling is a way to turn a high-dimensional 
problem into several one-dimensional problems. 

Here’s how it works for a bivariate problem. Suppose that (X, T) has den- 
sity fx,Y{x^y). First, suppose that it is possible to simulate from the condi- 
tional distributions fx\Y{^\y) and /y|x(l/k)- Let (Xo,yo) be starting values. 
Assume we have drawn (Xq, To), • • • , Then the Gibbs sampling al- 

gorithm for getting (X^+i,T^+i) is: 
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Gibbs Sampling 

Xn^l ^ fx\Y{x\Yn) 
YnYl ^ fY\x{y\^nYl) 
repeat 



This generalizes in the obvious way to higher dimensions. 

24.11 Example (Normal Hierarchical Model). Gibbs sampling is very useful 
for a class of models called hierarchical models. Here is a simple case. 
Suppose we draw a sample of k cities. From each city we draw rii people and 
observe how many people Yi have a disease. Thus, Yi ^ Binomial(n^,p^). We 
are allowing for different disease rates in different cities. We can also think of 
the p[s as random draws from some distribution F . We can write this model 
in the following way: 

P, r. F 

Yi\Pi=Pi ^ Binomial(n^,p^). 

We are interested in estimating the p's and the overall disease rate f pdF{p). 

To proceed, it will simplify matters if we make some transformations that 
allow us to use some Normal approximations. Let pi = Yijni. Recall that 
Pi « N{pi,Si) where s* = Let tpi = \og{pi/{l - Pi)) and 

define Zi = i/ji = log(p^/(l —Pi)). By the delta method, 

PS N{^i,af) 

where = l/(np^(l —Pi)). Experience shows that the Normal approximation 
for 'ip is more accurate than the Normal approximation for p so we shall work 
with ip. We shall treat as known. Furthermore, we shall take the distribution 
of the to be Normal. The hierarchical model is now 

%l>i ~ 

Zi\i^i ~ N{i;i,af). 

As yet another simplification we take r = 1. The unknown parameter are 
6 = (/i, ^01, . . . , '0/c)- The likelihood function is 

C{6) oc Ylf{ipi\p)Y[f{Zi\'tp) 

i i 

oc tlexp |-0V’i - exp ■ 
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If we use the prior /(/i) oc 1 then the posterior is proportional to the likelihood. 
To use Gibbs sampling, we need to find the conditional distribution of each 
parameter conditional on all the others. Let us begin by finding /(/ijrest) 
where “rest” refers to all the other variables. We can throw away any terms 
that don’t involve fi. Thus, 

/(/i|rest) oc P[exp 

oc exp - 5)d 



where 






Hence we see that /ijrest ^ iV(^, 1/^)- Next we will find /(t^jrest). Again, we 
can throw away any terms not involving ipi leaving us with 



/(^/>i|rest) oc exp 






oc exp|-^(t/>i-ei)^| 



where 



e,: = 



ITT 



and dj = 



1 + ^ 



and so t^^jrest ^ N{ei,d‘f). The Gibbs sampling algorithm then involves iter- 
ating the following steps N times: 



draw p 


~ N{b,v^) 


draw ^01 


~ N{eudj) 


draw ifjk 


~ N{ek,dl) 



It is understood that at each step, the most recently drawn version of each 
variable is used. 

We generated a numerical example with /c = 20 cities and n = 20 people 
from each city. After running the chain, we can convert each ipi back into pi 
by way of Pi = + e^"). The raw proportions are shown in Figure 24.4. 

Figure 24.3 shows “trace plots” of the Markov chain for pi and /i. Figure 
24.4 shows the posterior for p based on the simulated values. The second 
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panel of Figure 24.4 shows the raw proportions and the Bayes estimates. Note 
that the Bayes estimates are “shrunk” together. The parameter r controls 
the amount of shrinkage. We set r = 1 but, in practice, we should treat r as 
another unknown parameter and let the data determine how much shrinkage 
is needed. ■ 




^ J 

T ^ ^ ^ 

' O 500 1 0OO 



FIGURE 24.3. Posterior simulation for Example 24.11. The top panel shows simu- 
lated values of pi. The top panel shows simulated values of p. 



So far we assumed that we know how to draw samples from the conditionals 
fx\Y{x\y) and fY\x{y\x)‘ If we don’t know how, we can still use the Gibbs 
sampling algorithm by drawing each observation using a Metropolis-Hastings 
step. Let g be a proposal distribution for x and let ^be a proposal distribution 
for y. When we do a Metropolis step for X, we treat Y as fixed. Similarly, 
when we do a Metropolis step for T, we treat X as fixed. Here are the steps: 
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Metropolis within Gibbs 

(la) Draw a proposal Z q{z\Xn)- 

(lb) Evaluate 

. r f{z,Y„) q{Xn\z) 



(Ic) Set 



• I e/V 7 ~ It / IV It \ / -i I 



Z with probability r 
Xn with probability 1 — r. 



(2a) Draw a proposal Z q{z\Yn). 



(2b) Evaluate 
(2c) Set 



r 



min 



f{Xn+i,Z) q{Yn\Z) 1 



J Z with probability r 
I Yn with probability 1 — r. 



Again, this generalizes to more than two dimensions. 




- 0.6 0.0 0.6 




0.0 0.5 1.0 



FIGURE 24.4. Example 24.11. Top panel: posterior histogram of /j.. Lower panel: 
raw proportions and the Bayes posterior estimates. The Bayes estimates have been 
shrunk closer together than the raw proportions. 
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24.6 Bibliographic Remarks 

MCMC methods go back to the effort to build the atomic bomb in World War 
II. They were used in various places after that, especially in spatial statistics. 
There was a new surge of interest in the 1990s that still continues. My main 
reference for this chapter was Robert and Casella (1999). See also Gelman 
et al. (1995) and Gilks et ah (1998). 



24.7 Exercises 



1. Let 




(a) Estimate I using the basic Monte Garlo method. Use N = 100, 000. 
Also, find the estimated standard error. 

(b) Find an (analytical) expression for the standard error of your esti- 
mate in (a). Gompare to the estimated standard error. 

(c) Estimate I using importance sampling. Take g to be A^(1.5,u^) with 
u = .1, u = 1 and v = 10. Gompute the (true) standard errors in each 
case. Also, plot a histogram of the values you are averaging to see if 
there are any extreme values. 

(d) Find the optimal importance sampling function g* . What is the 
standard error using ^*? 

2. Here is a way to use importance sampling to estimate a marginal density. 
Let /x,v(^5 y) be a bivariate density and let (Xi, X 2 ), . . . , (X^v, Tv) ^ 
fx,Y' 

(a) Let w{x) be an arbitrary probability density function. Let 

1 ^ fx,Y{x,Yi)w{Xi) 

= -(A-., y.) ■ 

Show that, for each x, 

fx{x) fx{x). 

Find an expression for the variance of this estimator. 

(b) Let Y ^ X(0, 1) and X|T = y ^ 1 + y‘^). Use the method in 

(a) to estimate fx{x). 
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3. Here is a method called accept— reject sampling for drawing observa- 
tions from a distribution. 

(a) Suppose that / is some probability density function. Let g be any 
other density and suppose that f{x) < Mg{x) for all x, where M is a 
known constant. Consider the following algorithm: 

(step 1): Draw X ^ g and U ^ Unif(0, 1); 

(step 2): If U < f{X)/{Mg{X)) set Y = X, otherwise go back to step 
1. (Keep repeating until you hnally get an observation.) 

Show that the distribution of K is /. 

(b) Let / be a standard Normal density and let g{x) = 1/(1 + be 
the Cauchy density. Apply the method in (a) to draw 1,000 observations 
from the Normal distribution. Draw a histogram of the sample to verify 
that the sample appears to be Normal. 



4. A random variable Z has a inverse Gaussian distribution if it has 
density 



f{z) oc z exp |- 6 >iz “ ~ + 2^6162 + log (^\/ 2 ^ | , 
where 61 > f) and 6^2 > 0 are parameters. It can be shown that 



z > 0 



E(Z) = ,/g and E(i) = y|+y 



(a) Let 0i = 1.5 and 62 = 2 . Draw a sample of size 1,000 using the 
independence-Metropolis-Hastings method. Use a Gamma distribution 
as the proposal density. To assess the accuracy, compare the mean of Z 
and 1 /Z from the sample to the theoretical means Try different Gamma 
distributions to see if you can get an accurate sample. 

(b) Draw a sample of size 1,000 using the random- walk-Metropolis- 
Hastings method. Since z > 0 we cannot just use a Normal density. 
One strategy is this. Let W = logZ. Find the density of W. Use the 
random- walk-Metropolis-Hastings method to get a sample VEi , . . . ^Wn 
and let Z^ = e^L Assess the accuracy of the simulation as in part (a). 

5. Get the heart disease data from the book web site. Gonsider a Bayesian 
analysis of the logistic regression model 

gdo + j _ 1 Pj Xj 
2 _|_ = l PjXj 



= 1\X = x) 
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Use the flat prior /(/3q, . . . ^ j3k) oc 1. Use the Gibbs-Metropolis algorithm 
to draw a sample of size 10,000 from the posterior /(/3q, / 3i |data). Plot 
histograms of the posteriors for the /3j’s. Get the posterior mean and a 
95 percent posterior interval for each (5j. 

(b) Gompare your analysis to a frequentist approach using maximum 
likelihood. 
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Statistical Models 
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asymptotic theory, 71 
asymptotically Normal, 92, 126 
asymptotically optimal, 126 
asymptotically uniformly integrable, 
81 

average causal effect, 252 
average treatment effect, 252 
Axiom 1, 5 
Axiom 2, 5 
Axiom 3, 5 
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Fisher information matrix, 133 
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frequentist (or classical), 175 
frequentist inference, 89 
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Gaussian distribution, 28 
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Gibbs sampling, 416 
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goodness-of-ht tests, 168 
graphical, 294 

graphical log-linear models, 294 

Haar father wavelet, 340 
Haar scaling function, 340 
Haar wavelet regression, 343 
hierarchical log-linear model, 296 
hierarchical model, 56 
hierarchical models, 416 
histogram, 303, 305 
histogram estimator, 306 
Hoeffding’s inequality, 4 - 4 ^ b4, 365 
homogeneous, 384 
homogeneous Poisson process, 396 
Horwitz-Thompson, 188 
hypothesis testing, 94 
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importance sampling, 408 
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independent, 8, 8, 34 

Independent Events, 8 

independent random variables, 34 

independent variable, 89 

index set, 381 

indicator function, 5 

inequalities, 63 
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integrated squared error (ISE), 304 

intensity function, 395 

interarrival times, 396 

intervene, 273 

intervention, 273 

Introduction, 3 

invariant, 390 

inverse Gaussian distribution, 421 




438 Index 



irreducible, 388 
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jackknife, 115 
James-Stein estimator, 204 
Jeffreys-Lindley paradox, 192 
Jensen’s inequality, 66 
joint mass function, 31 

K-fold cross-validation, 364 
k-nearest-neighbors, 375 
kernel, 312 

kernel density estimator, 312, 313 
kernelization, 371 
Kolmogorov- Smirnov test, 245 
Kullback-Leibler distance, 126 

Laplace transform, ^ 

large sample theory, 71 

law of large numbers, 72 

law of total probability, 1 . 16 ^ 12 

lazy, 3 . 6 ^ 48 

least favorable prior, 198 
least squares estimates, 211 
leave-one-out cross-validation, 220 
leaves, 361 

Legendre polynomials, 329 
length, 327 
level, 150 

likelihood function, 122 
likelihood ratio statistic, 164 
likelihood ratio test, 164 
limit theory, 71 
limiting distribution, 391 
linear algebra notation, 231 
linear classiher, 353 
linearly separable, 369 
log odds ratio, 240 
log-likelihood function, 122 



log-linear expansion, 292 
log-linear model, 286 
log-linear models, 291 
logistic regression, 223 
loss function, 193 

machine learning, vii 
Manalahobis distance, 353 
marginal Distribution, 33 
marginal distribution, 197 
Markov chain, 383, 383 
Markov condition, 267 
Markov equivalent, 271 
Markov’s inequality, 4-^1 h3 
maximal clique, 285 
maximum likelihood, 122 
maximum likelihood estimates 
computing, 142 
maximum likelihood estimator 
consistent, 126 
maximum risk, 195 
mean, 47 

mean integrated squared error (MISE), 
304 

mean recurrence time, 390 
mean squared error, 91 
measurable, 13, 43 
median, 25 

bootstrap, 109 
Mercer’s theorem, 373 
method of moments estimator, 121 
Metropolis within Gibbs, 419 
Metropolis-Hastings algorithm, 411 
Mill’s inequality, ^.7, 65 
minimal conditional independence, 
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minimal sufficient, 138 
minimax rule, 197 , 198 
missing data, 187 
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model generator, 297 
model selection, 218 
moment generating function, ^ 
moments, 49 
monotone decreasing, 5 
monotone increasing, 5 
Monte Carlo integration, 404 
Monte Carlo integration method, 
404 

Monty Hall, 14 
most powerful, 152 
mother Haar wavelet, 341 
MSE, 91 

Multinomial, 235 
Multinomial distribution, 39 
multiparameter models, 133 
multiple regression, 216 
multiple testing, 165 
multiresolution analysis, 341 
Multivariate central limit theorem, 
5 . 12 , 78 

Multivariate Delta Method, 5 . 15 , 
79 

multivariate Normal, 234 
multivariate Normal distribution, 39 
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Nadaraya-Watson kernel estimator, 
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naive Bayes classiher, 359 
natural parameter, 141 
natural sufficient statistic, 140 
neural networks, 376 
Newton-Raphson, 143 
Neyman-Pearson, 10 . 30 , 170 
nodes, 281 
non-collider, 265 
non-null, 390 



nonparametric model, 88 
nonparametric regression, 319 
kernel approach, 319 
orthogonal function approach, 
337 

norm, 327 
normal, 327 
Normal distribution, 28 
Normal-based conhdence interval, 
6 . 16 , 94 

normalizing constant, 177, 403 
not, 10 

nuisance parameter, 120 
nuisance parameters, 88 
null, 390 

null hypothesis, 94, 149 

observational studies, 257 
odds ratio, 240 
olive statistics, i 

one-parameter exponential family, 
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one-sided test, 151 
optimality, 130 
orthogonal, 327 
orthogonal functions, 327 
orthonormal, 328 
orthonormal basis, 328 
outcome, 89 
overhtting, 218 
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pairwise Markov graph, 283 
parameter of interest, 120 
parameter space, 88 
parameters, 26 
parametric bootstrap, 134 
parametric model, 87 
parent, 265 
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partition, 5 
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Pearson’s lest, 241 
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periodic, 390 

permutation distribution, 162 
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plug-in estimator, 99 
point estimation, 90 
point mass distribution, 26 
pointwise asymptotic, 95 
Poisson distribution, 27 
Poisson process, 394, 395 
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posterior, 176 

large sample properties, 181 
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potential, 285 
potential outcomes, 251 
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prediction risk, 219 
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predictor variable, 209 
prior distribution, 176 
Probability, 5 
probability, 5 

probability distribution, 5, 5 
probability function, 22 
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probability measure, 5, 5 
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proposal, 411 
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random variable, 19 
independent, 34 
random vector, 38, 232 
random walk, 59 

random- walk-Metropolis-Hastings, 
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realizations, 3 
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nonparametric, 319 
regression function, 89, 209, 351 
regression through the origin, 226 
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residual sums of squares, 211 
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response variable, 89, 209 
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rule of the lazy statistician, 3. 6, 48 
Rules of d-separation, 270 
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sample mean, 51 
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spatially inhomogeneous, 340 
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statistical model, 87 
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Uniform distribution, 27 
unshielded collider, 266 

validation set, 363 
Vapnik-Chervonenkis, 366 
variance, ^ 

conditional, 55 

variance-covariance matrix, 53 
vertices, 281 

waiting times, 396 
Wald test, 153 




442 Index 



wavelets, 340 

weak law of large numbers (WLLN), 
5 . 6 , 76 

Zheng-Loh method, 222 




Springer Texts in Statistics (continued from page u) 



Lehmann: Testing Statistical Hypotheses, Second Edition 
Lehmann and Casella: Theory of Point Estimation, Second Edition 
Lindman: Analysis of Variance in Experimental Design 
Lindsey: Applying Generalized Linear Models 
Madansky: Prescriptions for Working Statisticians 

McPherson: Applying and Interpreting Statistics: A Comprehensive Guide, 
Second Edition 

Mueller: Basic Principles of Structural Equation Modeling: An Introduction to 
LISREL and EQS 

Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume I: 
Probability for Statistics 

Nguyen and Rogers: Fundamentals of Mathematical Statistics: Volume II: 
Statistical Inference 

Noether: Introduction to Statistics: The Nonparametric Way 

Nolan and Speed: Stat Labs: Mathematical Statistics Through Applications 

Peters: Counting for Something: Statistical Principles and Personalities 

Pfeiffer: Probability for Applications 

Pitman: Probability 

Rawlings, Pantula and Dickey: Applied Regression Analysis 
Robert: The Bayesian Choice: From Decision-Theoretic Foundations to 
Computational Implementation, Second Edition 
Robert and Casella: Monte Carlo Statistical Methods, Second Edition 
Rose and Smith: Mathematical Statistics with Mathematica 
Ruppert: Statistics and Finance: An Introduction 
Santner and Duffy: The Statistical Analysis of Discrete Data 
Saville and Wood: Statistical Methods: The Geometric Approach 
Sen and Srivastava: Regression Analysis: Theory, Methods, and Applications 
Shao: Mathematical Statistics, Second Edition 
Shorack: Probability for Statisticians 

Shumway and Staffer: Time Series Analysis and Its Applications 
Simonoff: Analyzing Categorical Data 
Terrell: Mathematical Statistics: A Unified Introduction 
Timm: Applied Multivariate Analysis 

Toutenburg: Statistical Analysis of Designed Experiments, Second Edition 
Wasserman: All of Statistics: A Concise Course in Statistical Inference 
Whittle: Probability via Expectation, Fourth Edition 
lacks: Introduction to Reliability Analysis: Probability Models 
and Statistical Methods 






ALSO AVAILABLE FROM SPRINGER! 



Anthony CL Alkiiuon 
Prlarm Flani 
AiulreB CrriuU 

Exploring 
Miillivariate Data 
with the 
Forward Search 



Thomas J. Sanlner 
firiBii ]. Witliams 
William [. Noix 

The Design and 
Analysis of 
Computer 
Experiments 



Design and 
Analysis of 
DNA Microarray 
Investigations 



EXPLORING MULTIVARIATE 
DATA WITH THE FORWARD 
SEARCH 

ANTHONY C. ATKINSON, MARCO RIANI, and 
ANDREA CERIOLI 

This book is about using graphs to explore and 
model continuous multivariate data. Such data are 
often modeled using the multivariate normal dis- 
tribution and there is a literature of weighty sta- 
tistical tomes presenting the mathematical theory 
of this activity. This book is very different. It focus- 
es on ways of exploring whether the data do indeed 
have a normal distribution. Outlier detection^ 
transformations to normality and the detection 
of clusters and unsuspected influential subsets 
are emphasized. 

2003/650 PP./ HARDCOVER/ ISBN 03S7^0S52-5 
SPRINGER SERIES IN STATISTICS 

THE DESIGN AND ANALYSIS OF 
COMPUTER EXPERIMENTS 

THOMAS J. SANTNER, RRIAN J, WILUAMS. and 
WILLIAM NOTZ 

This book describes methods for designing and 
analyzing experiments conducted using com- 
puter code in lieu of a physical experiment. It dis- 
cus.ses how to select the values of the factors at 
which to run the code (the design of the computer 
experiment) in Ugjit of the research objectives of 
the experimenter. It also provides techniques for 
analyzing the resulting data so as to achieve 
these research goals, and il illustrates these meth- 
ods with code that is available to the reader at the 
companion web site for the book. 
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DESIGN AND ANALYSIS OF DNA 
MICROARRAY INVESTIGATIONS 

RICHARD M. SIMON, EDWARD L. KORN, 

USA M. MCSHANE, MICHAEL D. R ADM AC HER, 

GEORGE W. WRIGHT and YINGDONG ZHAO 

This book is targeted to biologists with limited 
statistical background and to statisticians and 
computer scientists interested in being effective 
collaborators on multi-disciplinary DNAmicroai- 
ray projects. State-of-the-art analysis methods are 
presented with minimal mathematical notation and 
a focus on concepts. This book provides a sound 
preparation for designing microarray studies that 
have clear objectives, and for selecting analysis 
tools and strategies that provide dear and valid 
answers. The book offers an in depth under- 
standing of the design and analysis of experiments 
utilizing microarrays and should benefit 
scientists regardless of what software packages 
they prefer. 
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